R Passes SAS, but Python Leaves Them Both Behind

rm999 · on March 5, 2017

I'm glad to see the quick convergence on Python. I've evaluated Python every ~2 years since 2006 for "data science" tasks (machine learning, statistics, data munging, and visualization). I'd argue that Python only properly covered this full data science stack 1.5-2 years ago. R covered this stack adequately probably around 2011-2012. Matlab had this before 2005.

What makes Python a superior language to Matlab and R is the ease of software development. It's an easy, pleasant language to work in, and I trust it for production tasks (I've written production R code and it's fairly hard to read and fragile).

What's even better is data science is moving 100% into Python 3 (from 2) by 2020:

http://www.python3statement.org/

fnord123 · on March 5, 2017

If you work in a team mixing data engineers and data scientists then python has been a superior choice for a decade as the different team members are all using the same language to build the platform and to use it.

We've had a lot of success weaning people of R and MATLAB in finance (so replace "data scientist" with "quant"). Of course if you work in a field that doesn't have the libraries and can't build them in house then your mileage will certainly vary.

rm999 · on March 5, 2017

This is a great point. Python has been a solid data engineer/pipeline language for years now. It makes it easy for data scientists and data engineers (or people who wear both hats) to work in the same environment and codebase.

est · on March 6, 2017

The only area of un-touched by Python is web frontends and mobile apps.

Hope Web assembly would change that.

thearn4 · on March 5, 2017

Off-topic, but what is a "data engineer"?

nl · on March 6, 2017

In my team, "data engineer" refers to someone who deals with the ETL process, but also some feature engineering (ie, designing and extracting features. There's a lot of cross over here with the work Data Scientists do, but often the tools are different: for example, in the Python world the data engineer will use Pycharms while the data scientist will use Jupyter.

nerdponx · on March 5, 2017

I'm sure someone can come along with a better description, but it's kinda like devops/sysadmin but specifically for data storage and access.

You could also look it up...

sdenton4 · on March 6, 2017

At some point, people realized that 90% of data science work was building pipes and keeping them clean. And then data engineering was invented so that the data scientists wouldn't have to get their hands as dirty. :)

fnord123 · on March 6, 2017

A snappier version which gives the flavour: Data scientists use {tensor flow, scikit-learn, parquet, hadoop, spark, scrapers, etc}. Data engineers write {tensor flow, scikit-learn, parquet, hadoop, spark, scrapers, etc}.

xapata · on March 6, 2017

I'd include database engineer in that bin. We used to call people DBAs but that seems a bit limiting now.

mehaveaccount · on March 5, 2017

That's a good description for a data engineer from my end.

Alex3917 · on March 5, 2017

> What makes Python a superior language to Matlab and R is the ease of software development.

In practice I think the trends we're seeing have more to do with the fact that most universities now teach CS and data science using Python.

Given that Python is approximately as good as R, and it's becoming increasingly much easier to find good people to hire, there's very little reason not to be a Python shop.

nerdponx · on March 6, 2017

That, and people started transitioning into data science from other fields where Python was already more popular.

CloudYeller · on March 5, 2017

Another reason why I'm glad python is winning (and a fun read):

http://arrgh.tim-smith.us/

shanusmagnus · on March 5, 2017

Love!

"R is a shockingly dreadful language for an exceptionally useful data analysis environment. The more you learn about the R language, the worse it will feel. The development environment suffers from literally decades of accretion of stupid hacks from a community containing, to a first-order approximation, zero software engineers."

Yup.

charles-salvia · on March 6, 2017

The first time I tried to read R code I was immediately disgusted with how '.' is a valid variable name character. Not only valid, but inexplicably also in common use.

gone35 · on March 6, 2017

That's in part because, unaccountably, '_' denotes assignment in S.

yuubi · on March 7, 2017

> unaccountably

Maybe because in ASCII-63 the _ codepoint meant left-arrow (which was preserved in at least smalltalk systems much past the 60s)?

shanusmagnus · on March 7, 2017

My theory is that they might have been trying to protest against the non-mathematical theft of = (equality) to mean assignment, as = is used in most other languages. (Not that that's an excuse for violating the shit out of the Principle of Least Surprise in a trillion different ways.)

makeset · on March 6, 2017

It's immature to rail on R for things like how its syntax doesn't look like that other language you already know, but yes, it's extremely quirky. And its unchecked flexibility, while great for experts providing DSLs and such, enables non-programmers to produce garbage implementations. The vast majority of R's myriad community packages (R's biggest strength!) seem to be written by a grad student who read half a tutorial.

nerdponx · on March 5, 2017

How do you feel about Julia? Data analysis in Python has plenty of warts.

acmj · on March 5, 2017

Julia is great, but IMO, it needs to reach 1.0 first before making a splash. It would be frustrating if you found your tools developed on v0.4 stopped working on v0.5 released a year later. API/ABI stability is critical to the adoption of new programming languages/libraries.

Buttons840 · on March 5, 2017

I've heard they plan to release Julia 1.0 this year. So keep an eye on it.

acmj · on March 5, 2017

Yeah, I heard about it from others, too. Look forward to it. Hope the dev team stick to the plan.

tnecniv · on March 5, 2017

Rumor is it's going to be this summer at the annual conference.

fnord123 · on March 6, 2017

http://danluu.com/julialang/

roystonvassey · on March 6, 2017

+ The versatility of Python. It is one of the "funnest" languages to work with because I'm always surprised by the number and variety of packages available to do all kinds of tasks.

NumberSix · on March 5, 2017

There is a question regarding the definition of "data science jobs." The author explains his methodology in a lengthy report http://r4stats.com/articles/how-to-search-for-data-science-j...

The issue is what is "data science" really? In what respect is it different from traditional statistics and data analysis and not just a new buzzphrase?

Probably many jobs using SAS could be considered "data science" but don't use the specific buzz words and phrases that the author specifies in his methodology to identify "data science" jobs. Thus, the headline that "R Passes SAS" could be inaccurate, except in the sense that R is more popular among statistics and data analysis jobs that use "data science" buzzwords and phrases.

jl6 · on March 5, 2017

SAS is also a huge iceberg, without much in the way of open source culture that lends itself to visibility. A lot of SAS happens behind closed doors at megacorps.

StClaire · on March 5, 2017

Even some of the mega corps want to get away from SAS. A friend of mine who does data analytics at the power company told me they want to bring on more Python developers despite their regulators wanting things done in SAS.

R_haterade · on March 5, 2017

Also notable that so many new shops are foregoing it entirely. SAS, and its price tag, is a holdover from the days when 'analytics' was an afterthought for companies looking to maximize profit.

Now that much less mature companies are realizing the value of 'analytics' (I hate that word) SAS's cost doesn't really make sense.

bigger_cheese · on March 5, 2017

My company uses SAS extensively a large part of that is legacy (for the same reason we have a heap of Fortan code still). SAS has been in use here since the 80's. I have gone as far as porting some (non performance sensitive). C++ models into SAS because people in my org understand the language better.

I don't actually hate SAS and I'm very productive using it but at the same time I do feel that not knowing R limits my opportunities if I ever want to change jobs and work for an outside organisation.

We don't strictly use SAS for analytics. A big part of SAS we use is the "BI" side I don't know if that acronym is still in vogue but I'm talking about ad-hoc querying reporting etc. The kind of stuff one step above what you'd use a spreadsheet for if that makes sense.

I think where SAS excels is they have made it very easy for non experts to be productive with it. Kind of similar to MATLAB in engineering world if people are familiar with that.

A lot of non statiscians and non programmers use it inside my work (my background is engineering). Accountants, Managers, mechanical engineers etc are all pretty capable of using Enterprise Guide to run adhoc queries and generate reports and the like. The only other similar tool I'm aware of is IBM's Cognos. We used to use both packages (as well as Microsoft Access) but about 10 years ago the business agreed to standardize around SAS. I've heard there is a similar tool in the R world to Enterprise Guide (R-Studio I think???) but I'm not all that familiar with it I've heard it behaves more like an IDE rather then a drag and drop way to construct queries, graphs/reports etc.

If anyone has made the transition from SAS (or Cognos) to R (especially for a large org) I'd be keen to hear what tools you'd recommend and how the business found it?

R_haterade · on March 6, 2017

Rstudio is an IDE. If you want a gui for R, check out Rattle.

My shop is going the opposite direction, from R/excel to SAS+Cognos. I'm not happy about that decision, but the pay is good and the problems are still interesting.

halflings · on March 5, 2017

What word do you prefer to 'analytics', if we are talking about taking data and analysing it to extract insights?

This does not have to involve machine learning, and "pattern recognition" is an academic term (and might be confusing for laymen).

xapata · on March 5, 2017

Why not just "analysis"?

Business people seem determined to invent new jargon when our current vocabulary is sufficient.

R_haterade · on March 6, 2017

The euphemism treadmill makes my raises bigger.

R_haterade · on March 5, 2017

Definitely agreed. But the word analytics means too many different things to too many different people at this point. I'm in favor of more words to differentiate each part.

xapata · on March 5, 2017

Different types of analysis:

- descriptive analysis (clustering, summary statistics, etc.)

- predictive analysis / forecasting

- optimization

- automation

I suppose that last one involves engineering as well.

R_haterade · on March 6, 2017

We try to do this as much as possible where I work.

It's... kinda working.

Xcelerate · on March 5, 2017

I recently finished grad school and accepted a data scientist job a couple weeks ago in part because the job description mentioned expertise with Julia as one of the preferred qualifications (it's rare to see a listing that mentions Julia). It's the language I used for the bulk of my research over the last four years, and has been improving rapidly since it was released. I like Julia a lot more than Python, and I hope it continues rising in popularity. I think once it hits v1.0, we'll begin to see a lot more companies adopting its usage for data science, statistics, and machine learning.

fnord123 · on March 6, 2017

> I think once it hits v1.0, we'll begin to see a lot more companies adopting its usage for data science, statistics, and machine learning.

I don't. It doesn't have the incumbency of R, the use in other areas of programming of Python, or a company actively marketing it like MATLAB. It's not 5x or 10x or whatever good enough than the alternatives to assert itself in the playing field.

If it means you get your work done using it, be all means use it. But I think it will stay around clojure levels of use in data science, statistics, and machine learning.

whyrt12 · on March 6, 2017

Once you can compile a julia app- front end, back and probabilistic prog/ML to web assembly and have it run in browser and mobile, it will skyrocket in popularity.

fnord123 · on March 6, 2017

Why? The only current benefit of Julia afaict is the tracing jit. If you run the tracing jit on web assembly then it's giving up most of its performance benefits. And Python could be built on web assembly as well.

But who knows. Weird things seem to become popular despite all the negative points.

whyrt12 · on March 6, 2017

It does not have a tracing JIT, and its speed is by far not the only benefit. See link below

It can precompile very fast code before runtime.

Python will require an interpreter and or hefty runtime.

https://discourse.julialang.org/t/julia-motivation-why-weren...

fnord123 · on March 6, 2017

Forgive me, it's not a tracing JIT but just LLVM's JIT.

Precompiling in Julia is extremely not-straight-forward. You would think you just use --compile and it would work; but it doesn't at all.

Also, at ~850kb, Python's runtime is not that hefty. It's intended to be embedded and while it's quite a bit larger than lua's 200kb, but smaller than libjulia's 16mb.

statsmatscats · on March 6, 2017

Right, It can currently precompile to some extent, but full source-to-binary-blob-compilation is on the roadmap. See here: http://juliacomputing.com/blog/2016/02/09/static-julia.html

Julia's runtime includes its compiler and full huge standard lib, but of which are eventually going to be split off, IIUC.

The former because of static compilation potential and the latter into modules that can be included piecemeal.

sixhobbits · on March 5, 2017

I'm surprised that the author combines "the C languages" saying that most adverts that mention any of C/C++/C# mention all three. In my experience there is a large difference between companies searching for C# developers and those searching for C/C++. After blurring this distinction he concludes that R and Python are "very different languages" while I consider them to be largely overlapping.

brogrammernot · on March 5, 2017

Agreed.

So far in my search, C# leans toward Microsoft shops seeking C#/.NET whereas C/C++ has been companies searching for embedded software roles.

metaobject · on March 6, 2017

Mostly agree, but I've also seen a fair bit of C/C++ skills associated with jobs involving *nix development environments. I've rarely seen C++ associated with embedded jobs (even though I've read that it can certainly be used if care is taken to avoid things like dynamic dispatch, etc)

charles-salvia · on March 6, 2017

C++ is very big in finance and game development.

Arizhel · on March 6, 2017

C++ is used in avionics.

xapata · on March 6, 2017

R and Python have some similarities, but saying a they are largely overlapping is a bit of an exaggeration.

cwyers · on March 5, 2017

I hate these sort of comparisons. You've got R, Python, SAS... okay, those are sort of similar. Then you've got Java and "C, C++ or C#," and man, including C# with C/C++ is... fraught. Then you've got Hadoop, Spark, Hive... okay, those are all kind of different from what we've had so far. Now you've got Tableu and RapidMiner. Uh. In the second chart, "Microsoft" is included as a keyword. Okay. It's just... comparing apples, oranges, bananas, grapes and pears. What's it supposed to tell us?

dredmorbius · on March 5, 2017

"Microsoft was a difficult search since it appears in data science ads that mention other Microsoft products such as Windows or SQL Server. To eliminate such over-counting, I treated Microsoft different from the rest by including product names such as Azure Machine Learning and Microsoft Cognitive Toolkit. So there’s a good chance I went from over-emphasizing Microsoft to under-emphasizing it with only 157 jobs."

Read the methodology, Luke.

cwyers · on March 5, 2017

Okay, but what does that mean? Microsoft Cognitive Toolkit is like Tensorflow, which is included as an item on the list. Azure Machine Learning is something you can script with R or Python, and allows you to create APIs for predictive use on Azure. What good does lumping those together do? And what does it tell us that some things that are basically an R/Python library are less popular than R or Python themselves? It's this weird, uneven mix of things. Some are programming languages, some are libraries or frameworks, some are end-user products like Tableau.

nl · on March 6, 2017

This is pretty much what data science in the real world is like.

Define the question you are interested in (in this case, a somewhat reasonable attempt to compare R/Python/SAS) and the put other things in blobs with a note that says this is what this blob is. Enjoy.

dredmorbius · on March 6, 2017

Bingo. Thanks. I was losing (patience|interest).

tom_b · on March 5, 2017

Nice data munging out of Indeed.com here. The article author gives a detailed description of searching Indeed in a write-up linked from the original article as well (http://r4stats.com/articles/how-to-search-for-data-science-j...).

Just playing around with the search terms from that second linked article is also interesting - it would appear that many terms ("machine learning", "data science", "predictive modeling", some others) show that Amazon has the largest number of job listings from a single company - for "machine learning" Amazon shows 1706 listings out of 12499 or almost 14% of all listings . . . The way Amazon also pops out in other data science term searches is also interesting - at least in their job listings, Amazon seems to really be attempting to slurp up candidates with deeper data and stats skills.

For some time I have been somewhat cynical about data science. My impression has been that much of what has been pushed as data science jobs is thinly veiled data reporting gigs (just plain old business intelligence). While I still think data science is over-hyped, I think I need to reconsider just how critical it will be as a knowledge base or skill set. While there may not be a large number of deep learning jobs out there, the expectation that a data hacker can be expected to perform a linear or logistic regression against a set of gathered and cleaned data may be closer to fizz buzz than I previously assumed.

I am teaching an introductory programming class (using Python) this semester and students are definitely focused on data science as a career track.

IndianAstronaut · on March 6, 2017

R's data analysis libraries are still far ahead of Python. Dplyr, Shiny, etc. So much stuff is still in built in R.

Places where I still use R is its easy to use statistical functions, handling large amounts of missing data, etc.

blauditore · on March 5, 2017

I once worked on a piece of software communicating with a SAS instance (doing live decision management), but everything about it seemed sketchy. No one using it really liked it, and its internals always appeared like a blackbox to me. Also, either it is terribly engineered or the devs working with it were just bad - we wanted a JSON-based REST API, but they said that it's "not possible with SAS", so we fell back to badly organized HTTP calls with XML.

Does anybody have some insights about internal quality and code "health" in SAS?

bigger_cheese · on March 6, 2017

From experience I've found the "core" parts SAS Base, SAS GRAPH, OR, ETS etc are very stable. other parts we have had issues with in the past - Stored processes, LSF for scheduled jobs.

SAS tech support have been pretty good though I believe my workplace pays a lot for the privilege of being able to contact them direct.

One thing I have found is you can usually throw something together in a pretty "hackish" manner but typically there is a better and more optimal method of doing it which will be much more stable. Sometimes it doesn't hurt to ask tech support "What is the recommended method for doing..." Code which abuses the SAS macro language is especially notorious and a good candiate for asking this.

A quick google tells me first class JSON support is pretty new (dec 2016).

http://blogs.sas.com/content/sasdummy/2016/12/02/json-libnam...

I don't know much about this all the A2A messaging is XML based in my org.

R_haterade · on March 6, 2017

Anecdotal, but our SAS decision support environment went down Friday. I can report back after the post-mortem if you like.

blauditore · on March 6, 2017

Yeah, would surely be interesting.

R_haterade · on March 13, 2017

Nothing exciting to report. It was a resource-sizing problem and they didn't adequately separate prod and dev environments. Someone choked the dev and the prod went down with it.

blauditore · on March 16, 2017

Thanks anyway for the insight!

confounded · on March 5, 2017

This is a sample of non-data-science job postings, which mention buzzwords like "big data".

Java is not a popular language for data analysis.

robertk · on March 5, 2017

Python suuuucks.

llukas · on March 5, 2017

This attempt to eli5 could be improved.