Python is not a great language for data science

blog.genesmindsmachines.com

57 points by speckx 4 hours ago

RobinL 2 hours ago

I think a lot of this comes down to the question: Why aren't tables first class citizens in programming languages?

If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.

R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet (i.e. the way of expressing this). It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.

FWIW, I agree that Python is not great, but I think it's also true R is not great. I don't agree with the specific comparisons in the piece.

RodgerTheGreat 2 hours ago

There are a number of dynamic languages to choose from where tables/dataframes are truly first-class datatypes: perhaps most notably Q[0]. There are also emerging languages like Rye[1] or my own Lil[2].
I suspect that in the fullness of time, mainstream languages will eventually fully incorporate tabular programming in much the same way they have slowly absorbed a variety of idioms traditionally seen as part of functional programming, like map/filter/reduce on collections.
[0] https://en.wikipedia.org/wiki/Q_(programming_language_from_K...
[1] https://ryelang.org/blog/posts/comparing_tables_to_python/
[2] http://beyondloom.com/tools/trylil.html
kelipso 2 hours ago

People use data.table in R too (my favorite among those but it’s been a few years). data.table compared to dplyr is quite a contrast in terms of language to manipulate tabular data.
paddleon 2 hours ago

> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.
You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,
which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.
kevinhanson 2 hours ago

this is my biggest complaint about SAS--everything is either a table or text.
most procs use tables as both input and output, and you better hope the tables have the correct columns.
you want a loop? you either get an implicit loop over rows in a table, write something using syscalls on each row in a table, or you're writing macros (all text).
jna_sh 2 hours ago

I know the primary data structure in Lua is called a table, but I’m not very familiar with them and if they map to what’s expected from tables in data science.
- Jtsummers 2 hours ago
  
  Lua's tables are associative arrays, at least fundamentally. There's more to it than that, but it's not the same as the tables/data frames people are using with pandas and similar systems. You could build that kind of framework on top of Lua's tables, though.
  https://www.lua.org/pil/2.5.html
- TheSoftwareGuy 2 hours ago
  
  IIRC those are basically hash tables, which are first-class citizens in many languages already
nextos 2 hours ago

I don't think this is the real problem. In R and Julia tables are great, and they are libraries. The key is that these languages are very expressive and malleable.
Simplifying a lot, R is heavily inspired by Scheme, with some lazy evaluation added on top. Julia is another take at the design space first explored by Dylan.
CivBase 2 hours ago

What is a table other than an array of structs?
- thom 2 hours ago
  
  It’s not that you can’t model data that way (or indeed with structs of arrays), it’s just that the user experience starts to suck. You might want a dataset bigger than RAM, or that you can transparently back by the filesystem, RAM or VRAM. You might want to efficiently index and query the data. You might want to dynamically join and project the data with other arrays of structs. You might want to know when you’re multiplying data of the wrong shapes together. You might want really excellent reflection support. All of this is obviously possible in current languages because that’s where it happens, but it could definitely be easier and feel more of a first class citizen.
- RobinL 2 hours ago
  
  I would argue that's about how the data is stored. What I'm trying to express is the idea of the programming language itself supporting high level tabular abstractions/transformations such as grouping, aggregation, joins and so on.
  - camdenreslink an hour ago
    
    Sounds a lot like LINQ in .NET (which is usually compatible with ORMs actually querying tables).
  - CivBase 2 hours ago
    
    Ah, that makes more sense. Thanks for the clarification.

jakobnissen 2 hours ago

Excellent article - except that the author probably should have gated their substantiation of the claim behind a cliffhanger, as other commenters have mentioned.

The author's priorities are sensible, and indeed with that set of priorities, it makes sense to end up near R. However, they're not universal among data scientists. I've been a data scientist for eight years, and have found that this kind of plotting and dataframe wrangling is only part of the work. I find there is usually also some file juggling, parsing, and what the author calls "logistics". And R is terrible at logistics. It's also bad at writing maintainable software.

If you care more about logistics and maintenance, your conclusion is pushed towards Python - which still does okay in the dataframes department. If you're ALSO frequently concerned about speed, you're pushed towards Julia.

None of these are wrong priorities. I wish Julia was better at being R, but it isn't, and it's very hard to be both R and useful for general programming.

Edit: Oh, and I should mention: I also teach and supervise students, and I KEEP seeing students use pandas to solve non-table problems, like trying to represent a graph as a dataframe. Apparently some people are heavily drawn to use dataframes for everything - if you're one of those people, reevaluate your tools, but also, R is probably for you.

whyenot 2 hours ago

What makes Python a great language for data science, is that so many people are familiar with it, and that it is an easy language to read. If you use a more obscure language like Clojure, Common Lisp, Julia, etc., many people will not be familiar with the language and unable to read or review your code. Peer review is fundamental to the scientific endeavor. If you only optimize on what is the best language for the task, there are clearly better languages than Python. If you optimize on what is best for science then I think it is hard not to argue that Python (and R) are the best choices. In science, just getting things done is not enough. Other people need to be able to read and understand what you are doing.

BTW AI is not helping and in fact is leading to a generation of scientists who know how to write prompts, but do not understand the code those prompts generate or have the ability to peer review it.

iLemming 35 minutes ago

I can't speak for Julia - never used it; never used Common Lisp for analyzing data (I don't think it's very "data-oriented" for the modern age and the shape of data), but Clojure is really not "obscure" - it only looks weird for the first fifteen minutes or so; once you start using it - it is one of the most straightforward and reasonable languages out there - it is in fact simpler than Python and Javascript. Immutable-by-default makes it far much easier to reason about the code. And OMG, it is so much more data-oriented - it's crazy that more people don't use it. Most never even heard about it.

forgotpwd16 3 hours ago

Article is well written but fails to address its own thesis by postponing it to a sequel article. At its current state only alludes that Python is not great because requires specialized packages. (And counterexample is R for which also used a package.)

stevenpetryk 2 hours ago

Totally agree. The author's most significant example is two code snippets that are quite similar and both pretty nice.

pacbard 2 hours ago

When you think about a data science pipeline, you really have three separate steps:

[Data Preparation] --> [Data Analysis] --> [Result Preparation]

Neither Python or R does a good job at all of these.

The original article seems to focus on challenges in using Python for data preparation/processing, mostly pointing out challenges with Pandas and "raw" Python code for data processing.

This could be solved by switching to something like duckdb and SQL to process data.

As far as data analysis, both Python and R have their own niches, depending on field. Similarly, there are other specialized languages (e.g., SAS, Matlab) that are still used for domain-specific applications.

I personally find result preparation somewhat difficult in both Python and R. Stargazer is ok for exporting regression tables but it's not really that great. Graphing is probably better in R within the ggplot universe (I'm aware of the python port).

niemandhier 2 hours ago

Python is just a language that:

1. Is easy to read

2. Was easy to extend in languages that people who work with scientific data happen to like.

When I did my masters we hacked around in the numpy source and contributed here and there while doing astrophysics.

Stuff existed in Java and R, but we had learned C in the first semester and python was easier to read and contrary to MATLAB numpy did not need a license.

When data science came into the picture, the field was full of physicists that had done similar things. They brought their tools as did others.

mushufasa 2 hours ago

Languages inherently have network effects; most people around the world learn English so they can talk with other professionals who also know English, not because they are passionate about Charles Dickens.

My take (and my own experience) is that python won because the rest of the team knows it. I prefer R but our web developers don't know it, and it's way better for me to write code that the rest of our team can review, extend, and maintain.

rdtsc 2 hours ago

They basically advocate using R. I think it depends what they mean by "data science" and if the person will be doing just data science. If that's the case then R may be better. As in their whole career is going to built on that domain. But let's say they are on a general computer science track, now they'll probably benefit from learning Python more than R, simply because they can use it for other purposes.

> Either way, I’ll not discuss it further here. I’ll also not consider proprietary languages such as Matlab or Mathematica, or fairly obscure languages lacking a wide ecosystem of useful packages, such as Octave.

I feel, to most programming folks R is in the same category. R is to them what Octave is to the author. R is nice nice, but do they really want to learn a "niche" language, even if it has better some features than Python? Is holding a whole new paradigm, syntax, library ecosystem in your head worth it?

drchaim 43 minutes ago

Python was a great language for data science, when data science become a mainstream thing.

it was easy to think about the structures (iterators) it was easy to extend. it had a good community.

And for that, people start extending it via libraries.

There are plenty more alternatives now.

huherto 2 hours ago

For what is worth. The Kotlin folks have been adding some cool features and tools for data analysis. https://kotlinlang.org/docs/data-analysis-overview.html

solatic an hour ago

Shell is the best language for data science. Pick the best tools for each of getting data, cleaning data, transforming data, and visualizing data, then stitch them together by sheer virtue of the fact that text is the universal interoperable protocol and files are the universal way of saving intermediate stages of data.

Best part is, write a --help, and you can load them into LLMs as tools to help the LLMs figure it out for you.

Fight me.

iLemming 2 hours ago

From many practical points, Clojure is great for data. And you can even leverage python libs via clj-python.

phforms 2 hours ago

In the past few years I have seen some serious efforts from the Clojure community to make Clojure more attractive for data science. Check out the Scicloj[1] group and their data science stack/toolkit Noj[2] (still in beta) as well as the high-performance tabular data processing library tech.ml.dataset (TMD)[3].
- [1] https://scicloj.github.io
- [2] https://scicloj.github.io/noj
- [3] https://github.com/techascent/tech.ml.dataset

programmertote 2 hours ago

Disclaimer: I have nothing against R or Python and I'm not partial to either.

Python, the language itself, might not be a great language for data science. BUT the author can use Pandas or Polars or another data-science-related library/framework in Python to get the job done that s/he was trying to write in R. I could read both her R and Pandas code snippets and understand them equally.

This article reads just like, "Hey, I'm cooking everything by making all ingredients from scratch and see how difficult it is!".

yeahwhatever10 2 hours ago

A little late for this

ASalazarMX 2 hours ago

"Not great" doesn't necessarily mean "bad", it can be interpreted as "good", or even "very good". An honest title would have explicitly qualified how suitable the author found it was.
That the author avoided saying Python was a bad language outright speaks a great deal of its suitability. Well, that, and the majority data science in practice.

kasperset 2 hours ago

R data science people generally come to data science field from life science or stats field. Python data science people generally originate from other fields that are mostly engineering focused. Again this may not apply to all the cases but that is my general observation.

Recently I am seeing that Python is heavily pushed for all data science related things. Sometimes objectively Python may not be the best option especially for stats. It is hard to change something after it becomes the "norm" regardless of its usability.

spicybbq an hour ago

Part 2 is here:

https://blog.genesmindsmachines.com/p/python-is-not-a-great-...

paulfharrison 2 hours ago

R is so good in part because of the efforts of people like Di Cook, Hadley Wickham, and Yihui Xie to create an software environment that they like working in.

It also helps that in R any function can completely change how its arguments are evaluated, allowing the tidyverse packages to do things like evaluate arguments in the context of a data frame or add a pipe operator as a new language feature. This is a very dangerous feature to put in the hands of statisticians, but it allows more syntactic innovation than is possible in Python.

cb321 2 hours ago

Like Python, R is a 2 (+...) language system. C/Fortran backends are needed for performance as problems scale up.
Julia and Nim [1] are dynamic and static approaches (respectively) to 1 language systems. They both have both user-defined operators and macros. Personally, I find the surface syntax of Julia rather distasteful and I also don't live in PLang REPLs / emacs all day long. Of course, neither Julia nor Nim are impractical enough to make calling C/Fortran all that hard, but the communities do tend to implement in the new language without much prompting.
[1] https://nim-lang.org/

serjester 2 hours ago

Seems like their critique boils down to two areas - pandas limitations and fewer built ins to lean on.

Personally I've found polars has solved most of the "ugly" problems that I had with pandas. It's way faster, has an ergonomic API, seamless pandas interop and amazing support for custom extensions. We have to keep in mind Pandas is almost 20 years old now.

I will agree that Shiny is an amazing package, but I would argue it's less important now that LLMs will write most of your code.

thom 21 minutes ago

I think this expectation that data science code is a thing you write basically top to bottom to get some answers out, put them in a graph and move on with your life is not a useful lens through which to evaluate two programming languages. R definitely is an efficient DSL for doing stats this way, but it’s a painful way to build a durable piece of software. Python is nowhere near perfect but I’ve seen fewer codebases that made my eyes bleed, however pretty the graphs might look.

NuSkooler 2 hours ago

You could end it with "Python is not a great language".

Now, is Python a SUCCESSFUL language? Very.

huherto 2 hours ago

Isn't the author saying that Python + Pandas is almost as good as R, but Python without Pandas is less powerful than R.

I can't help to conclude that Python is as good as R because I still have the choice of using Pandas when I need it. What did I get wrong?

paddleon 2 hours ago

you missed the "almost as" in your first sentence.
also, we didn't define "good".

exabrial 2 hours ago

The problem is there's so much momentum behind it that's hard to course correct. PyTorch is now a goliath.

jswelker 2 hours ago

Inherited Python code is a mixed bag. Inherited R code is a nightmare.

lenerdenator 3 hours ago

> I think people way over-index Python as the language for data science. It has limitations that I think are quite noteworthy. There are many data-science tasks I’d much rather do in R than in Python.1 I believe the reason Python is so widely used in data science is a historical accident, plus it being sort-of Ok at most things, rather than an expression of its inherent suitability for data-science work.

Python doesn't need to be the best at any one thing; it just has to be serviceable for a lot of things. You can take someone who has expertise in a completely different domain in software (web dev, devops, sysadmin, etc.) and introduce them to the data science domain without making them learn an entirely new language and toolchain.

dmurray 3 hours ago

That's not why it's used in data science though. Lots of data scientists use Python all day and have no concept of ever working in a different field.
It's used in data science because it's used in data science.
- vkazanov 2 hours ago
  
  It's used in data science because no other language has this level of library support.
  And it got this unprecedented level of support because right from the start it made its focus clear syntax and (perceived) simplicity.
  There is also a sort of cumulative effect from being nice for algorithmic work.
  Guido's long-term strategy won over numerous other strong candidates for this role.
  - passivegains 26 minutes ago
    
    I think the key thing not obvious to most data scientists is they're not using python because it meets their needs, it's because we've failed them. twice.
    1. data scientists aren't programmers, so why do they need a programming language? the tools they should be using don't exist. they'd need programmers to make them, and all we have to offer is... more programming languages.
    2. the giant problem at the heart of modern software: the most important feature of a modern programming language is being easy to read and write. this feature is conspicuously absent from most important languages.
    they're trapped. they can't do what they need without a programming language but there are only a handful they can possibly use. the real reason python ended up with such good library support is they never really had a choice.
- mohaine 2 hours ago
  
  But data science usually isn't an island.
  Use whatever you want on your one off personal projects but use something more non-data science friendly if you ever want your model to run directly in a production workflow.
  Productionizing R models is quite painful. The normal way is to just rewrite it not in R.
  - dmurray 8 minutes ago
    
    I've soured a lot on directly productionizing data science code. It's normally an unmaintainable mess.
    If you write it in R and then rewrite it in C (better: rewrite it in English with the R as helpful annotations, then have someone else rewrite it in C), at least there is some chance you've thought about the abstractions and operations that are actually necessary for your problem.
- lenerdenator 2 hours ago
  
  That's probably true now, but at one point, they were looking for people to start doing data science, and were pulling people from other domains.

Lyngbakr 2 hours ago

I was a bit disappointed to discover that this was essentially an R vs. Python article, which is a data science trope. I've been in the field for 20+ years now and while I used to be firmly on team R, I now think that we don't really have a good language for data science. I had high hopes for Julia and even Clojure's data landscape looks interesting, but given the momentum of Python I don't see how it could be usurped at this point.

vkazanov 2 hours ago

It is EVERYWHERE. I recently had to interview a bunch of data scientists, and only one of them knew SQL. Surely, all of then worked with python. I bet none of them even heard of R.
- garciasn 2 hours ago
  
  SAS > R > Python.
  The focus of SAS and R were primarily limited to data science-related fields; however, Python is a far more generic programming language, thus the number of folks exposed to it is wider and thus the hiring pool of those who come in exposed to Python is FAR LARGER than SAS/R ever were, even when SAS was actively taught/utilized in undergraduate/graduate programs.
  As a hiring leader in the Data Science and Engineering space, I have extensive experience with all of these + SQL, among others. Hiring has become much easier to go cross-field/post-secondary experience and find capable folks who can hit the ground running.
  - username135 2 hours ago
    
    you beat me to it. i understand why sas gets hate but I think that comes with simply not understanding how powerful it is.
    
    garciasn 2 hours ago
    
    It was a great language, but it was/is extremely cost-prohibitive plus it simply fell out of favor in academia, for many of the same reasons, and thus was supplanted by free alternatives.
- Lyngbakr 2 hours ago
  
  Yikes. Were they experienced data scientists or straight out of school? I find it very odd (and a bit scary) that they didn't know SQL.
  - garciasn 2 hours ago
    
    Experienced Data Scientists and/or those straight out of school are EXTREMELY lacking in valuable SQL experience and always have been. Take a DS with 25 years experience in SAS, many of them are great with DATAstep, but have far less experience using PROC SQL for querying the data in the most effective way--even if they were pulling the data down with pass-through via SAS/ACCESS.
    Often they'd be doing very simplistic querying and then manipulating via DATAstep prior to running whatever modeling and/or reporting PROCs later, rather than pushing it upstream into a far faster native database SQL pull via pass-through.
    Back in 2008/2009, I saved 30h+ runtime on a regular report by refactoring everything in SQL via pass-through as opposed to the data scientists' original code that simply pulled the data down from the external source and manipulated it in DATAstep. Moving from 30h to 3m (Oracle backend) freed up an entire FTE to do more than babysit a long-running job 3x a week to multiple times per day.
SiempreViernes 2 hours ago

What would it even mean to be a "good language for data science"?
In the first place data science is more a label someone put on bag full of cats, rather than a vast field covered by similarly sized boxes.
username135 2 hours ago

SAS has entered the chat