Pandas 3.0

(pandas.pydata.org)

267 points by jonbaer 6 days ago|117 comments

•

edschofield 22 hours ago

The design of Pandas is inferior in every way to Polars: API, memory use, speed, expressiveness. Pandas has been strictly worse since late 2023 and will never close the gap. Polars is multithreaded by default, written in a low-level language, has a powerful query engine, supports lazy, out-of memory execution, and isn’t constrained by any compatibility concerns with a warty, eager-only API and pre-Arrow data types that aren’t nullable.

It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.

•

data-ottawa 19 hours ago

Pandas deserves a ton of respect in my opinion. I built my career on knowing it well and using it daily for a decade, so I’m biased.

Pandas created the modern Python data stack when there was not really any alternatives (except R and closed source). The original split-apply-combine paradigm was well thought out, simple, and effective, and the built in tools to read pretty much anything (including all of your awful csv files and excel tables) and deal with timestamps easily made it fit into tons of workflows. It pioneered a lot, and basically still serves as the foundation and common format for the industry.

I always recommend every member of my teams read Modern Pandas by Tom Augspurger when they start, as it covers all the modern concepts you need to get data work done fast and with high quality. The concepts carry over to polars.

And I have to thank the pandas team for being a very open and collaborative bunch. They’re humble and smart people, and every PR or issue I’ve interacted with them on has been great.

Polars is undeniably great software, it’s my standard tool today. But they did benefit from the failures and hard edges of pandas, pyspark, dask, the tidyverse, and xarray. It’s an advantage pandas didn’t have, and they still pay for.

I’m not trying to take away from polars at all. It’s damn fast — the benchmarks are hard to beat. I’ve been working on my own library and basically every optimization I can think of is already implemented in polars.

I do have a concern with their VC funding/commercialization with cloud. The core library is MIT licensed, but knowing they’ll always have this feauture wall when you want to scale is not ideal. I think it limits the future of the library a lot, and I think long term someone will fill that niche and the users will leave.

•

neves 17 hours ago

Is this the Modern Pandas reference you recommend?

https://tomaugspurger.net/posts/modern-1-intro/

•

data-ottawa 16 hours ago

Yes it is

•

nothrowaways 16 hours ago

Very well articulated.

•

sampo 20 hours ago

Historically 18 years ago, Pandas started as a project by someone working in finance to use Python instead of Excel, yet be nicer than using just raw Python dicts and Numpy arrays.

For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place.

Prepare some data

    df_pandas = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [10, 20, 30, 40, 50]})
    df_polars = pl.from_pandas(df_pandas)

And then

    df_pandas.loc[1:3, 'b'] += 1

    df_pandas
       a   b
    0  1  10
    1  2  21
    2  3  31
    3  4  41
    4  5  50

Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column.

    df_polars = df_polars.with_columns(
        pl.when(pl.int_range(0, pl.len()).is_between(1, 3))
        .then(pl.col("b") + 1)
        .otherwise(pl.col("b"))
        .alias("b")
    )

If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.

P.S. Polars has an optimization to overwite a single value

    df_polars[4, 'b'] += 5
    df_polars
    ┌─────┬─────┐
    │ a   ┆ b   │
    │ --- ┆ --- │
    │ i64 ┆ i64 │
    ╞═════╪═════╡
    │ 1   ┆ 10  │
    │ 2   ┆ 21  │
    │ 3   ┆ 31  │
    │ 4   ┆ 41  │
    │ 5   ┆ 55  │
    └─────┴─────┘

But as far as I know, it doesn't allow slicing or anything.

•

richardbachman 8 minutes ago

`row_index()` was also recently added.

  df.with_columns(pl.col.b + pl.row_index().is_between(1, 3))
  # shape: (5, 2)
  # ┌─────┬─────┐
  # │ a   ┆ b   │
  # │ --- ┆ --- │
  # │ i64 ┆ i64 │
  # ╞═════╪═════╡
  # │ 1   ┆ 10  │
  # │ 2   ┆ 21  │
  # │ 3   ┆ 31  │
  # │ 4   ┆ 41  │
  # │ 5   ┆ 50  │
  # └─────┴─────┘

> Polars has an optimization to overwite a single value

I believe it is just "syntax sugar" for calling `Series.scatter()`[1]

> it doesn't allow slicing

I believe you are correct:

  df_polars[1:3, "b"] += 1
  # TypeError: cannot use "slice(1, 3, None)" for indexing

You can do:

  df_polars[list(range(1, 4)), "b"] += 1

Perhaps nobody has requested slice syntax? It seems like it would be easy to add.

[1]: https://github.com/pola-rs/polars/blob/9079e20ae59f8c75dcce8...

•

goatlover 16 hours ago

The Polars code puts me off as being too verbose and requiring too many steps. I love the broadcasting ability that Pandas gets from Numpy. It's what sceintific computing should look like in my opinon. Maybe R, Julia or some array-based language does it a bit better than Numpy/Pandas, but it's certainly not like the Polars example.

•

thijsn 14 hours ago

Polars is indeed more verbose when coming from pandas, but in my experience it is an advantage for when you're reading that same code after not having touched it for months.

pandas is write-optimized, so you can quickly and powerfully transform your data. Once you're used to it, it allows you to quickly get your work done. But figuring out what is happening in that code after returning to it a while later is a lot harder compared to Polars, which is more read-optimized. This read-optimized API coincidentally allows the engine to perform more optimizations because all implicit knowledge about data must be typed out instead of kept in your head.

•

goatlover 14 hours ago

I don't agree that more verbose code is necessarily more readable when the shorter code looks like familiar math. All you have to do is learn how operators broadcast across array-like structures, how slicing and filtering works. Perhaps with more complicated examples the shorter code becomes harder to read after months away? Mathematicians are able to handle a lot of compact equations.

No doubt some of this comes down to preference as to what's considered readable. I never really bought that argument that regular expressions create more problems than they're worth. Perhaps I side on the expressivity end of the readability debate.

•

thereisnospork 14 hours ago

Likewise, I was considering trying Polaris until I saw that example. The pandas example is a good approximation of how I think and want to transform/process data even if it is ugly under the hood. I do occasionally find numpy and pandas annoying wrt when the return a view vs a copy but the cure seems worse than the disease.

•

satvikpendem 19 hours ago

"If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton

Polars is great, but it is better precisely because it learned from all the mistakes of Pandas. Don't besmirch the latter just because it now has to deal with the backwards compatibility of those mistakes, because when it first started, it was revolutionary.

•

crystal_revenge 17 hours ago

Can one criticize pandas by comparing to R's native DataFrames that have existed since R's inception in the 90s?

I (and many others) hated Pandas long before Polars was a thing. The main problem is that it's a DSL that doesn't really work well with the rest of Python (that and multi-index is awful outside of the original financial setting). If you're doing pure data science work it doesn't really come up, but as soon as you need to transform that work into a production solution it starts to feel quite gross.

Before Polars my solution was (and still largely remains) to do most of the relational data transformations in the data layer, and the use dicts, lists and numpy for all the additional downstream transformations. This made it much easier to break out of the "DS bubble" and incorporate solutions into main products.

•

vegabook 19 hours ago

"revolutionary"? It just copied and pasted the decades-old R (previous "S") dataframe into Python, including all the paradigms (with worse ergonomics since it's not baked into the language).

•

data-ottawa 18 hours ago

No other modern language will compete with R on ergonomics because of how it allows functions to read the context they’re called in, and S expressions are incredibly flexibly. The R manual is great.

To say pandas just copied it but worse is overly dismissive. The core of pandas has always been indexing/reindexing, split-apply-combine, and slicing views.

It’s a different approach than R’s data tables or frames.

•

aidos 15 hours ago

> allows functions to read the context they’re called in

Can you show an example? Seems interesting considering that code knowing about external context is not generally a good pattern when it comes to maintainability (security, readability).

I’ve lived through some horrific 10M line coldfusion codebases that embraced this paradigm to death - they were a whole other extreme where you could _write_ variables in the scope of where you were called from!

•

condwanaland 15 hours ago

Say I have a dataframe called 'penguins'

I can write code like: penguin_sizes <- select(penguins, weight, height)

Here, weight and height are columns inside the dataframe. But I can refer to them as if they were objects in the environment (I., e without quotes) because the select function looks for them inside the penguins dataframe (it's first argument)

This is a very simple example but it's used extensively in some R paradigms

•

data-ottawa 14 hours ago

Yes, this exactly.

And its why you can do plot(x, sin) and get properly labelled graphs. It also powers the formula API that made caret and glm modules so easy to use.

•

sampo 17 hours ago

This is an interesting question.

Dataframes first appeared in S-PLUS in 1991-1992. Then R copied S, and from 1995-1996-1997 onwards R started to grow in popularity in statistics. As free and open source software, R started to take over the market among statisticians and other people who were using other statistical software, mainly SAS, SPSS and Stata.

Given that S and R existed, why were they mostly not picked up by data analysts and programmers in 1995-2008, and only Python and Pandas made dataframes popular from 2008 onwards?

•

xtracto 18 hours ago

Exactly. I was programming in R in 2004 and Pandas didnt exist. I remember trying Pandas once and it felt unergonomic for fata analysis and it lacked the vast library of statistical analysis library.

•

BeetleB 18 hours ago

It was revolutionary to Python. Without NumPy and Pandas, ML in Python would never have been a thing.

(Yes, yes - I know some people wish that were the case!)

•

Xunjin 19 hours ago

Indeed, even Rust was created learning with the mistakes of memory management and known patterns like the famous RAII.

•

bicepjai 17 hours ago

With all great observations made, the quote still stands. "If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton When people say I feel the sense of community, this is exactly what it means in software philosophy: we do something, others learn from it, and make better ones. In no way is the inspiration’s origin below what it inspired.

•

v3ss0n 22 hours ago

Sounds too much like an advertisement. Also we need to watch out when diving into Polars . Polars is VC backed Opensource project with cloud offering , which may become an opencore project - we know how those goes.

•

gkbrk 22 hours ago

> we know how those go

They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source.

•

stingraycharles 22 hours ago

Depends on your definition of popular; plenty of examples where the business interests don't align well with open source.

•

v3ss0n 2 hours ago

not many can maintain a complex project in full time.

•

quentindanjou 17 hours ago

I was also thinking that this comment looks like an AD. Pandas does not have any paid option and isn't made directly for profit.

•

disgruntledphd2 15 hours ago

To be fair, as someone who's fought pandas for many years I agree with basically everything they said. The API design for Polars is much, much more intuitive. It's a base R to dplyr level change.

•

rdedev 19 hours ago

While polars is better if you work with predefined data formats, pandas is imo still better as a general purpose table container.

I work with chemical datasets and this always involves converting SMILES string to Rdkit Molecule objects. Polars cannot do this as simply as calling .map on pandas.

Pandas is also much better to do EDA. So calling it worse in every instance is not true. If you are doing pure data manipulation then go ahead with polars

•

data-ottawa 18 hours ago

Map is one operation pandas does nicely that most other “wrap a fast language” dataframe tools do poorly.

When it feels like you’re writing some external udf thats executed in another environment, it does not feel as nice as throwing in a lambda, even if the lambda is not ideal.

•

vegabook 16 hours ago

you have map_elements in polars which does exactly this.

https://docs.pola.rs/api/python/dev/reference/expressions/ap...

You can also iter_rows into a lambda if you really want to.

https://docs.pola.rs/api/python/stable/reference/dataframe/a...

Personally I find it extremely rare that I need to do this given Polars expressions are so comprehensive, including when.then.otherwise when all else fails.

•

data-ottawa 14 hours ago

That one has a bit more friction than pandas because the return schema requirement -- pandas let's you get away with this bad practice.

It also does batches when you declare scalar outputs, but you can't control the batch size, which usually isn't an issue, but I've run into situations where it is.

•

rich_sasha 22 hours ago

I almost fully agree. I would add that Pandas API is poorly thought through and full of footguns.

Where I certainly disagree is the "frame as a dict of time series" setting, and general time series analysis.

The feel is also different. Pandas is an interactive data analysis container, poorly suited for production use. Polars I feel is the other way round.

•

thelastbender12 21 hours ago

I think that's a fair opinion, but I'd argue against it being poorly thought out - pandas HAS to stick with older api decisions (dating back to before data science was a mature enough field, and it has pandas to thank for much of it) for backwards compatibility.

•

ohyoutravel 20 hours ago

Well this is like saying Python must maintain backwards compatibility with Python 2 primitives for all time. It’s simply not true. It’s not easy to deprecate an old API, but it’s doable and there are playbooks for it. Pandas is good, I’ve used it extensively, but agree it’s not fit for production use. They could catch up to the state of the art, but that requires them being very opinionated and willing to make some unpopular decisions for the greater good.

•

cruffle_duffle 19 hours ago

Why though? polars sounds like the rewrite! It’s okay to cycle into a new library. Let pandas do its thing and polars slowly take over as new projects overtake. There is nothing wrong with this and it happens all the time.

Like jquery, which hasn’t fundamentally changed since I was a wee lad doing web dev. They didn’t make major changes despite their approach to web dev being replaced by newer concepts found on angular, backbone, mustache, and eventually react. And that is a good thing.

What I personally don’t want is something like angular that basically radically changed between 1.0 and 2.0. Might as well just call 2.0 something new.

Note: I’ve never heard of polars until this comment thread. Can’t wait to try it out.

•

ptman 20 hours ago

3.0 is the perfect place to break compat

•

sirfz 21 hours ago

I think that's a sane take. Indeed, I think most data analysts find it much easier to use pandas over polars when playing with data (mainly the bracket syntax is faster and mostly sensible)

•

lairv 20 hours ago

I would agree if not for the fact that polars is not compatible with Python multiprocessing when using the default fork method, the following script hangs forever (the pandas equivalent runs):

    import polars as pl
    from concurrent.futures import ProcessPoolExecutor

    pl.DataFrame({"a": [1,2,3], "b": [4,5,6]}).write_parquet("test.parquet")

    def read_parquet():
        x = pl.read_parquet("test.parquet")
        print(x.shape)

    with ProcessPoolExecutor() as executor:
        futures = [executor.submit(read_parquet) for _ in range(100)]
        r = [f.result() for f in futures]

Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloader

•

skylurk 19 hours ago

You are not wrong, but for this example you can do something like this to run in threads:

  import polars as pl
  
  pl.DataFrame({"a": [1, 2, 3]}).write_parquet("test.parquet")
  
  
  def print_shape(df: pl.DataFrame) -> pl.DataFrame:
      print(df.shape)
      return df
  
  
  lazy_frames = [
      pl.scan_parquet("test.parquet")
      .map_batches(print_shape)
      for _ in range(100)
  ]
  pl.collect_all(lazy_frames, comm_subplan_elim=False)

(comm_subplan_elim is important)

•

ritchie46 19 hours ago

Python 3.14 "spawns" by default.

However, this is not a Polars issue. Using "fork" can leave ANY MUTEX in the system process invalid (a multi-threaded query engine has plenty of mutexes). It is highly unsafe and has the assumption that none of you libraries in your process hold a lock at that time. That's an assumption that's not PyTorch dataloaders to make.

•

lairv 18 hours ago

Default to "spawn" is definitely the right thing, it avoids many footguns

That said for PyTorch DataLoader specifically, switching from fork to spawn removes copy-on-write, which can significantly increase startup time and more importantly memory usage. It often requires non-trivial refactors, many training codebase aren't designed for this and will simply OOM. So in practice for this use case, I've found it more practical to just use pandas rather than doing a full refactor

•

schmidtleonard 20 hours ago

I can't believe parallel processing is still this big of a dumpster fire in python 20 years after multi-core became the rule rather than the exception.

Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?

•

ritchie46 19 hours ago

Polars does that for you.

•

skylurk 19 hours ago

This is one of the reasons I use polars.

•

lairv 19 hours ago

Well I think ProcessPoolExecutor/ThreadPoolExecutor from concurrent.futures were supposed to be that

•

datsci_est_2015 16 hours ago

Might be cool once PySpark integrates with Polars, but for now like many others I’m stuck with dropping into pandas for non-vectorized operations

•

jvican 15 hours ago

Is there any plan for this?

•

devin-petersohn 15 hours ago

Funny enough, I actually just (2 weeks ago) added support for streaming from Pyspark to Polars/DuckDB/etc through Arrow PyCapsule. By streaming, I mean actually streaming, not collecting all data at once. It won't be released probably until May/June but it's there: https://github.com/apache/spark/commit/ecf179c3485ba8bac72af...

•

datsci_est_2015 15 hours ago

Not that I’m aware of. The Spark ecosystem seems a little too “stable” to be putting effort into that kind of development.

Edit: hah, based on the sibling comment, I stand corrected

•

bikelang 8 hours ago

All of this is true and I agree with you - but this comment comes off a bit disrespectful.

•

bovermyer 17 hours ago

As someone who just encountered Pandas for the first time as part of an Intro to Data Visualization course a few weeks ago, I am now very curious about Polars.

The professor doesn't actually care which tool we use as long as we produce nice graphs, so this is as good a time as any to experiment.

•

mharrison 15 hours ago

"every way" is strong words.

Pandas is better for plotting and third party integration.

•

vaylian 15 hours ago

> The design of Pandas is inferior in every way to Polars

I used Pandas a lot with Jupyter notebooks. I don't have any experience with Polars. Is it also possible to work with Polars dataframes in Jupyter notebooks?

•

disgruntledphd2 15 hours ago

Yes. Most things just work with Polars. The one issue for me is the need for geopandas.

•

torcete 18 hours ago

I didn't know about polars, and I can see that they also have a library for R. However, in R, they have a fiercer competition. I wonder how it compares to tidyverse, which is the stablished data analysis library.

•

bhadass 20 hours ago

why not just go full bore to duckdb?

•

data-ottawa 18 hours ago

A dataframe API allows you to write code in Python, with native syntax highlighting and your LSP can complete it, in one analysis file. Inlined SQL is not as nice, and has weird ergonomics.

UDFs in most dataframe libraries tend to feel better than writing udfs for a sql engine as well.

Polars specifically has lazy mode which enables a query optimizer, so you get predicate push down and all the goodies if SQL, with extra control/primitives (sane pivoting, group_by_dynamic, etc)

I do use ibis on top of duckdb sometimes, but the UDF situation persists and the way they organize their docs is very difficult to use.

•

vegabook 19 hours ago

because method chaining in Polars is much more composable and ergonomic than SQL once the pipeline gets complex which makes it superior in an exploratory "data wrangling" environment.

•

data-ottawa 18 hours ago

Duckdb does support pipe operators as an extension, which is a welcome addition to sql engines for me.

But I do agree with you.

•

pelasaco 16 hours ago

are many of the mentioned issues not just some vibe-code sessions away from done?

•

noitpmeder 7 hours ago

Give it a shot and report back when you get them merged

•

pelasaco 3 hours ago

not my circus not my monkeys

•

noo_u 20 hours ago

Polars took a lot of ideas from Pandas and made them better - calling it "inferior in every way" is all sorts of disrespectful :P

Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind.

•

skylurk 20 hours ago

Luckily, polars has .to_pandas() so you can still pass pandas dataframes to the libraries that really are still stuck on that interface.

I maintain one of those libraries and everything is polars internally.

•

adolph 18 hours ago

> pandas dataframes

Didn't Pandas move to Arrow, matching Polars, in version 2?

•

noo_u 19 hours ago

to_pandas has a dependency on pandas - it is not the biggest of deals, but worth keeping in mind.

•

postalcoder 23 hours ago

I've migrated off of pandas to polars for my workflows to reap the benefit of, in my experience a 10-20x speedup on average. I can't imagine anything bringing me back short of a performance miracle. LLMs have made syntax almost a non-barrier.

•

lvl155 22 hours ago

Went from pandas to polars to duckdb. As mentioned elsewhere SQL is the most readable for me and LLM does most of the coding on my end (quant). So I need it at the most readable and rudimentary/step-wise level.

OT, but I can’t imagine data science being a job category for too long. It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.

•

data-ottawa 18 hours ago

As a long time DS I sadly feel we filled the field with people who don’t do any actual data science or engineering. A lot of it is glorified BI users who at most pull some averages and run half baked AB tests.

I don’t think the field will go away with AI, frankly with LLMs I’ve automated that bottom 80% of queries I used to have to do for other users and now I just focus on actual hard problems.

That “build a self serve dashboard” or number fetching is now an agentic tool I built.

But the real meat of “my business specializes in X, we need models to do this well” has not yet been replaceable. I think most hard DS work is internal so isn’t in training sets (yet).

•

claytonjy 15 hours ago

Even before LLMs, Data Science was being replaced by more specialization, IME.

Data Engineers took over the plumbing once they moved on from Scala and Spark. ML Engineers took over the modeling (and LLMs are now killing this job too, as it’s rare to need model training outside of big labs). Data analysts have to know SQL and python these days, and most DS are now just this, but with a nicer title and higher pay.

Once upon a time I thought DS would be much more about deeper statistics and causal inference, but those have proven to be rare, niche needs outside soft science academia.

•

datsci_est_2015 14 hours ago

Reading a comment like this makes me realize how broad the title “Data Scientist” is, especially this tidbit:

> as it’s rare to need model training outside of big labs

Do you think there are pre-trained models for e.g. process optimization for the primary metallurgy process for steel manufacturing? Industrial engineers don’t know anything about machine learning (by trade), and there are companies that bring specialized Data Science know-how to that industry to improve processes using modern data-driven methods, especially model building.

It’s almost like 99% of comments on this topic think that DS begins at image classification and ends at LLMs, with maybe a little bit of landing page A/B testing or something. Wild.

> Once upon a time I thought DS would be much more about deeper statistics and causal inference, but those have proven to be rare, niche needs outside soft science academia.

This is my entire career lol.

•

datsci_est_2015 16 hours ago

> It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.

Depends what your definition of “to go” means. Responsibilities swallowed by peers? Sure, and new job titles might pop up like Research & Development Engineer or something.

The discipline of creating automated systems to extract insights from data to create business value? I can’t really see that going anywhere. I mean, why tf would we be building so many data centers if there’s no value in the data they’re storing.

•

iugtmkbdfil834 20 hours ago

<< It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.

This is interesting. I wanted to dig into it a little since I am not sure I am following the logic of that statement.

Do you mean that AI would take over the field, because by default most people there are already not producing anything that a simple 'talk to data' LLM won't deliver?

•

mynameisash 20 hours ago

Not GP, but as a data engineer who has worked with data scientists for 20 years, I think the assessment is unfortunately true.

I used to work on teams where DS would put a ton of time into building quality models, gating production with defensible metrics. Now, my DS counterparts are writing prompts and calling it a day. I'm not at all convinced that the results are better, but I guess if you don't spend time (=money) on the work, it's hard to argue with the ROI?

•

datsci_est_2015 16 hours ago

In what field do you work?

> writing prompts and calling it a day

What does this mean? They’re not creating pull requests and maintaining learning / analytics systems?

This kind of vagueposting gets on my nerves.

•

mynameisash 14 hours ago

> They’re not creating pull requests and maintaining learning / analytics systems?

Sure, they check prompts into git. And there are a few notebooks that have been written and deployed, but most of that is collecting data and handing it off to ChatGPT. No, they're not maintaining learning/analytics systems. My team builds our data processing pipelines, and we support everything in production.

> This kind of vagueposting gets on my nerves.

What is vague about my comment?

Whereas in the past, the DS teams I worked with would do feature engineering and rigorous evaluation of models with retraining based on different criteria, now I'm seeing that teams are being lazy and saying, "We'll let the LLM do things. It can handle unstructured data, and we can give it new data without additional work on our part." Hence, they're simply writing a prompt and not doing much more.

•

datsci_est_2015 14 hours ago

I have never heard of this. What kind of insights are being generated? What kind of data? Am I unaware that we’re at the point that I can give a CSV of e.g. industrial measurement data to an LLM and it provides reliable and repeatable output? Are people making decisions based on the LLM output? Do the people making those decisions based on that output know that it might be completely hallucinated and the only response they’ll get from the “Data Scientists” is a shoulder shrug?

So many questions. That’s why I called it vague. I don’t know how any data scientist could read this and not have a million follow up questions. Is this offline learning? Online learning? What are the guardrails? Are there guardrails? Mostly, wtf?

•

mritchie712 23 hours ago

also migrated, but to duckdb.

It's funny to look back at the tricks that were needed to get gpt3 and 3.5 to write SQL (e.g. "you are a data analyst looking at a SQL database with table [tables]"). It's almost effortless now.

•

wodenokoto 18 hours ago

Do you use it from within Python or just ingest straight into duckdb.exe or duckdb UI?

•

howling 23 hours ago

Same. I don't even use LLM normally as I found polars' syntax to be very intuitive. I just searched my ChatGPT history and the only times I used it are when I'm dealing with list and struct columns that were not in pandas.

•

postalcoder 23 hours ago

iirc part of pandas’ popularity was that it modeled some of R’s ergonomics. What a time in history, when such things mattered! (To be clear, I’m not making fun of pandas. It was the bridge I crossed that moved me from living in Excel to living in code.)

•

iugtmkbdfil834 20 hours ago

I learned about pandas with R in my class way back when. At the time, it seemed like magic. In a sense, it still does, but things evolve.

•

gHA5 23 hours ago

Do you not experience LLM generated code constantly trying to use Pandas' methods/syntax for Polars objects?

•

edschofield 23 hours ago

Yes, ChatGPT 5.2 Pro absolutely still does this. Just ask it for a pivot table using Polars and it will probably spit out code with Pandas arguments that doesn’t work.

•

postalcoder 23 hours ago

There were some growing pains in gpt-3.5 to gpt-4 era, but not nowadays (shoutout to the now-defunct Phind, which was a game changer back then).

•

crimsoneer 23 hours ago

The fact they pivoted away from their very compelling core offering (AI stack overflow) to complete with loveable etc in the "AI generated apps" giant fight continues to baffle me. Though I guess model updates ate their lunch.

•

postalcoder 22 hours ago

My guess is that their pivot came after distress, and was not the cause of it. It'd be great to have @rushingcreek write a post-mortem. I think it'd benefit a lot of people because I honestly don't have a monday morning playbook of what could have saved them.

Like you said, perhaps the demise of phind was inevitable, with large models displacing them kind of like how Spotify displaced music piracy.

•

thibaut_barrere 22 hours ago

Polars being so fast, and embeddable into other languages, has made it a no brainer for me to adopt it.

I have integrated Explorer https://github.com/elixir-explorer/explorer, which leverages it, into many Elixir apps, so happy to have this.

•

alex7o 23 hours ago

Same, also polars works on typescript which I used at some point out move my data from backend to frontend

•

thegabriele 21 hours ago

" 10-20x speedup on average. "

Is this everyone's experience?

•

OGWhales 19 hours ago

It depends on the specifics, but I converted a couple of scripts recently that would take minutes to run with Pandas that only took seconds to run with Polars. I was pretty impressed.

•

mynameisash 19 hours ago

That was probably about what I got when I migrated some heavy number crunching code from Pandas to Polars a few years ago. Maybe even better than that.

•

mjhay 16 hours ago

It’s a typical experience. Polars is fast, and Pandas is very slow and memory-hungry. It would be one thing if Pandas had a good API, but it doesn’t.

•

OutOfHere 23 hours ago

The speedup you claim is going to be contingent on how you use Pandas, with which data types, and which version of Pandas.

•

kayson 18 hours ago

Are there any pandas alternatives that offer stronger column typing? Ideally something where I can have the schema defined in advance, validate the data, then have the type checker be smart enough to know that df.foo exists and is float and df.bar doesn't.

I tried pandera and it left a lot to be desired. Static frame [1] seems promising but doesn't appear to be popular for some reason.

1. https://static-frame.readthedocs.io/en/latest/

•

teekert 19 hours ago

I have deep respect for Pandas, it, and Jupyter-lab were my intro to programming. And it worked much better for me, I did some "intro to Python" courses, but it was all about strs and ints. And yes, you can add strs together! Wow magic... Not for me. For me it all clicked when I first looped through a pile of Excel files (pd.read_excel()), extracted info I needed and wrote a new Excel file... Mind blown.

From there, of course, you slowly start to learn about types etc, and slowly you start to appreciate libraries and IDEs. But I knew tables, and statistics and graphs, and Pandas (with the visual style of Notebooks) lead me to programming via that familiar world. At first with some frustration about Pandas and needing to write to Excel, do stuff, and read again, but quickly moving into the opposite flow, where Excel itself became the limiting factor and being annoyed when having to use it.

I offered some "Programming for Biologists" courses, to teach people like me to do programming in this way, because it would be much less "dry" (pd.read.excel().barplot() and now you're programming). So far, wherever I offered the courses they said they prefer to teach programming "from the base up". Ah well! I've been told I'm not a programmer, I don't care. I solve problems (and that is the only way I am motivated enough to learn, I can't sit down solving LeetCode problems for hours, building exactly nothing).

(To be clear, I now do the Git, the Vim, the CI/CD, the LLM, the Bash, The Linux, the Nix, the Containers... Just like a real programmer, my journey was just different, and suited me well, I believe others can repeat my journey and find joy in programming, via a different route.)

•

jtrueb 21 hours ago

That timestamp resolution discrepancy is going to cause so many problems

•

EForEndeavour 19 hours ago

Do you mean the new default datetime resolution of microseconds instead of the previous nanosecond resolution? Obviously this will require adjustments to any code that requires ns resolution, but I'd bet that's a tiny minority of all pandas code ever written. Do you have a particular use case in mind for the problems this will cause?

•

jtrueb 16 hours ago

I would describe it as the huge majority, reflecting on my pandas use over the years. Pretty much all of the data worth exploring in pandas over excel, some data gui, or polars involves timestamps.

•

ciupicri 15 hours ago

Yeah, but is nanosecond-level resolution necessary? In many scenarios, a resolution of one second is adequate.

•

jtrueb 14 hours ago

I don't need nanosecond accuracy. I just know there are a lot of scripts expecting it.

•

alexcasalboni 20 hours ago

Haven't used pandas in a while, but Copy-on-Write sounds pretty cool! Is there any public benchmark I can check in 2026?

•

gku 17 hours ago

The need to upgrade Pandas, combined with emerging AI tools, might accelerate Polars adoption, let’s see what happens.

•

QuadmasterXLII 19 hours ago

Ugh, I'm still recovering from numpy breaking changes with 2.0

•

optimalsolver 23 hours ago

How soon will the leading LLMs ingest the updated documentation? Because I'm certainly not going to.

•

uncletoxa 23 hours ago

Use context7 mcp. It'll do the trick

•

leadingthenet 18 hours ago

I've been sleeping on this, works like a charm!

•

esafak 18 hours ago

You could create skills out of the docs if you use it a lot. https://agentskills.io/

•

g-mork 20 hours ago

This is the most misunderstood aspect of how marketing has changed recently

•

OutOfHere 23 hours ago

In my experience, it would take a year to ingest it natively, and two years to also ingest enough coding examples.

•

swyx 17 hours ago

its not even about the ingest, every major semver change now is a problem because now LLMs will need to contextually distinguish whether or not they are expected to output Pandas 2 or 3, unless ofc you explicitly prompt it.

•

OutOfHere 16 hours ago

I wouldn't worry about it because over a longer period, this automatically leans toward the more recent versions. There are multiple forces that exist to make this happen.

The main exception is for legacy code requiring maintenance when they are unwilling to upgrade Pandas.

•

swyx 16 hours ago

yes but a lot of legacy code wil still need to be maintained and written. you dont see how this can be confusing/annoying?

•

OutOfHere 23 hours ago

s/impactfull/impactful

•

Havoc 19 hours ago

Regex is great when one is communicating with machines

Pandas 3.0

edschofield 22 hours ago

data-ottawa 19 hours ago

neves 17 hours ago

data-ottawa 16 hours ago

nothrowaways 16 hours ago

sampo 20 hours ago

richardbachman 8 minutes ago

goatlover 16 hours ago

thijsn 14 hours ago

goatlover 14 hours ago

thereisnospork 14 hours ago

satvikpendem 19 hours ago

crystal_revenge 17 hours ago

vegabook 19 hours ago

data-ottawa 18 hours ago

aidos 15 hours ago

condwanaland 15 hours ago

data-ottawa 14 hours ago

sampo 17 hours ago

xtracto 18 hours ago

BeetleB 18 hours ago

Xunjin 19 hours ago

bicepjai 17 hours ago

v3ss0n 22 hours ago

gkbrk 22 hours ago

stingraycharles 22 hours ago

v3ss0n 2 hours ago

quentindanjou 17 hours ago

disgruntledphd2 15 hours ago

rdedev 19 hours ago

data-ottawa 18 hours ago

vegabook 16 hours ago

data-ottawa 14 hours ago

rich_sasha 22 hours ago

thelastbender12 21 hours ago

ohyoutravel 20 hours ago

cruffle_duffle 19 hours ago

ptman 20 hours ago

sirfz 21 hours ago

lairv 20 hours ago

skylurk 19 hours ago

ritchie46 19 hours ago

lairv 18 hours ago

schmidtleonard 20 hours ago

ritchie46 19 hours ago

skylurk 19 hours ago

lairv 19 hours ago

datsci_est_2015 16 hours ago

jvican 15 hours ago

devin-petersohn 15 hours ago

datsci_est_2015 15 hours ago

bikelang 8 hours ago

bovermyer 17 hours ago

__mharrison__ 15 hours ago

vaylian 15 hours ago

disgruntledphd2 15 hours ago

torcete 18 hours ago

bhadass 20 hours ago

data-ottawa 18 hours ago

vegabook 19 hours ago

data-ottawa 18 hours ago

pelasaco 16 hours ago

noitpmeder 7 hours ago

pelasaco 3 hours ago

noo_u 20 hours ago

skylurk 20 hours ago

adolph 18 hours ago

noo_u 19 hours ago

postalcoder 23 hours ago

lvl155 22 hours ago

data-ottawa 18 hours ago

claytonjy 15 hours ago

datsci_est_2015 14 hours ago

datsci_est_2015 16 hours ago

iugtmkbdfil834 20 hours ago

mynameisash 20 hours ago

datsci_est_2015 16 hours ago

mynameisash 14 hours ago

datsci_est_2015 14 hours ago

mharrison 15 hours ago