Open Source Authors: Liz McMillan, David Smith, Lori MacVittie, Ali Hussain, Elizabeth White

Blog Feed Post

On the growth of R and Python for data science

A recent article by Matt Asay claims that "Python is displacing R as the language for data science". Python has certainly made some great strides in recent years, evolving beyond a data processing tool (an area where Python excels) to a data analysis tool. The Pandas project, in particular, has greatly expanded Python's ability to handle statistical data sets (introducing an object akin to R's data frame), and added some time series handling tools. But Python is still a long, long way from being able to support the range of statistical procedures supported by the core R language, let alone those provided by the 5000 community-contributed packages in CRAN.  Asay's article is heavy on anecdote but light on actual data to support its claim. (ComputerWorld's Sharon Machlis does a great job pointing out the irony there.) Nonetheless, data do exist on R and Python usage; while there's no user-registration data for open-source projects, secondary sources can provide intelligence on how open source projects are being used. RStudio's Hadley Wickham uses data from the developer Q&A site StackOverflow to chart the number of open questions asked per month about R and Python, as a proxy for active usage: As an general-purpose data processing tool, it's no surprise that Python has more activity than the domain-specific analytics language R. But it's clear that both are growing explosively (Wickham describes the growth as "very close to being exponential"). Looking closer though, we see that the proportion of R questions, as a fraction of Python questions, is also growing rapidly: This belies the claim that Python is displacing R. In fact, this chart suggests the reverse is true, and that R usage is growing at a faster rate than Python. More data points come from user surveys. In the 2013 KDNuggets poll of top languages for analytics, data mining and data science, R was the most-used software for the third year running (60.9%), with Python in second place (38.8%). More tellingly, R's usage grew almost four times faster than Python's in 2013 versus 2012 (8.4 percentage points for R, compared to 2.7 percentage points for Python). It's a similar story on the community side. R has more than 125 active user groups worldwide, and the number of user group meetings has increased by 41% in the last year. Python has around 400 user groups (I couldn't find stats on the growth rate), but RedMonk's Stephen O'Grady compares the communities devoted to data science: At RedMonk, we typically bet on the bigger community, but that’s not as easy here. Python’s total community is obviously much larger, but it seems probable that R’s community, which is more or less strictly focused on data science, is substantially larger than the subset of the Python community specifically focused on data. My personal take is that there's more than enough room for both Python and R. As the data science boom continues, both will continue to grow as more and more practitioners enter the world of statistical computing. Some (especially those that come from a computer-science background) will choose Python. And those that come from a statistics or data science background will choose R (or will have already learned R in their studies). And even some that come from the die-hard developer community will end up loving R. But both communities will consider to advance the art of data science, and as open-source communities will inevitably cross-pollinate each other. R has already influenced Python in the realm of data analysis, and it would be no bad thing if Python were to influence R in other areas. That, after all, is the beauty of open source software.

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid