I'm glad to see the quick convergence on Python. I've evaluated Python every ~2 years since 2006 for "data science" tasks (machine learning, statistics, data munging, and visualization). I'd argue that Python only properly covered this full data science stack 1.5-2 years ago. R covered this stack adequately probably around 2011-2012. Matlab had this before 2005.
What makes Python a superior language to Matlab and R is the ease of software development. It's an easy, pleasant language to work in, and I trust it for production tasks (I've written production R code and it's fairly hard to read and fragile).
What's even better is data science is moving 100% into Python 3 (from 2) by 2020:
If you work in a team mixing data engineers and data scientists then python has been a superior choice for a decade as the different team members are all using the same language to build the platform and to use it.
We've had a lot of success weaning people of R and MATLAB in finance (so replace "data scientist" with "quant"). Of course if you work in a field that doesn't have the libraries and can't build them in house then your mileage will certainly vary.
This is a great point. Python has been a solid data engineer/pipeline language for years now. It makes it easy for data scientists and data engineers (or people who wear both hats) to work in the same environment and codebase.
In my team, "data engineer" refers to someone who deals with the ETL process, but also some feature engineering (ie, designing and extracting features. There's a lot of cross over here with the work Data Scientists do, but often the tools are different: for example, in the Python world the data engineer will use Pycharms while the data scientist will use Jupyter.
At some point, people realized that 90% of data science work was building pipes and keeping them clean. And then data engineering was invented so that the data scientists wouldn't have to get their hands as dirty. :)
A snappier version which gives the flavour: Data scientists use {tensor flow, scikit-learn, parquet, hadoop, spark, scrapers, etc}. Data engineers write {tensor flow, scikit-learn, parquet, hadoop, spark, scrapers, etc}.
> What makes Python a superior language to Matlab and R is the ease of software development.
In practice I think the trends we're seeing have more to do with the fact that most universities now teach CS and data science using Python.
Given that Python is approximately as good as R, and it's becoming increasingly much easier to find good people to hire, there's very little reason not to be a Python shop.
"R is a shockingly dreadful language for an exceptionally useful data analysis environment. The more you learn about the R language, the worse it will feel. The development environment suffers from literally decades of accretion of stupid hacks from a community containing, to a first-order approximation, zero software engineers."
The first time I tried to read R code I was immediately disgusted with how '.' is a valid variable name character. Not only valid, but inexplicably also in common use.
My theory is that they might have been trying to protest against the non-mathematical theft of = (equality) to mean assignment, as = is used in most other languages. (Not that that's an excuse for violating the shit out of the Principle of Least Surprise in a trillion different ways.)
It's immature to rail on R for things like how its syntax doesn't look like that other language you already know, but yes, it's extremely quirky. And its unchecked flexibility, while great for experts providing DSLs and such, enables non-programmers to produce garbage implementations. The vast majority of R's myriad community packages (R's biggest strength!) seem to be written by a grad student who read half a tutorial.
Julia is great, but IMO, it needs to reach 1.0 first before making a splash. It would be frustrating if you found your tools developed on v0.4 stopped working on v0.5 released a year later. API/ABI stability is critical to the adoption of new programming languages/libraries.
+ The versatility of Python. It is one of the "funnest" languages to work with because I'm always surprised by the number and variety of packages available to do all kinds of tasks.
The issue is what is "data science" really? In what respect is it different from traditional statistics and data analysis and not just a new buzzphrase?
Probably many jobs using SAS could be considered "data science" but don't use the specific buzz words and phrases that the author specifies in his methodology to identify "data science" jobs. Thus, the headline that "R Passes SAS" could be inaccurate, except in the sense that R is more popular among statistics and data analysis jobs that use "data science" buzzwords and phrases.
SAS is also a huge iceberg, without much in the way of open source culture that lends itself to visibility. A lot of SAS happens behind closed doors at megacorps.
Even some of the mega corps want to get away from SAS. A friend of mine who does data analytics at the power company told me they want to bring on more Python developers despite their regulators wanting things done in SAS.
Also notable that so many new shops are foregoing it entirely. SAS, and its price tag, is a holdover from the days when 'analytics' was an afterthought for companies looking to maximize profit.
Now that much less mature companies are realizing the value of 'analytics' (I hate that word) SAS's cost doesn't really make sense.
My company uses SAS extensively a large part of that is legacy (for the same reason we have a heap of Fortan code still). SAS has been in use here since the 80's. I have gone as far as porting some (non performance sensitive). C++ models into SAS because people in my org understand the language better.
I don't actually hate SAS and I'm very productive using it but at the same time I do feel that not knowing R limits my opportunities if I ever want to change jobs and work for an outside organisation.
We don't strictly use SAS for analytics. A big part of SAS we use is the "BI" side I don't know if that acronym is still in vogue but I'm talking about ad-hoc querying reporting etc. The kind of stuff one step above what you'd use a spreadsheet for if that makes sense.
I think where SAS excels is they have made it very easy for non experts to be productive with it. Kind of similar to MATLAB in engineering world if people are familiar with that.
A lot of non statiscians and non programmers use it inside my work (my background is engineering). Accountants, Managers, mechanical engineers etc are all pretty capable of using Enterprise Guide to run adhoc queries and generate reports and the like. The only other similar tool I'm aware of is IBM's Cognos. We used to use both packages (as well as Microsoft Access) but about 10 years ago the business agreed to standardize around SAS. I've heard there is a similar tool in the R world to Enterprise Guide (R-Studio I think???) but I'm not all that familiar with it I've heard it behaves more like an IDE rather then a drag and drop way to construct queries, graphs/reports etc.
If anyone has made the transition from SAS (or Cognos) to R (especially for a large org) I'd be keen to hear what tools you'd recommend and how the business found it?
Rstudio is an IDE. If you want a gui for R, check out Rattle.
My shop is going the opposite direction, from R/excel to SAS+Cognos. I'm not happy about that decision, but the pay is good and the problems are still interesting.
Definitely agreed. But the word analytics means too many different things to too many different people at this point. I'm in favor of more words to differentiate each part.
I recently finished grad school and accepted a data scientist job a couple weeks ago in part because the job description mentioned expertise with Julia as one of the preferred qualifications (it's rare to see a listing that mentions Julia). It's the language I used for the bulk of my research over the last four years, and has been improving rapidly since it was released. I like Julia a lot more than Python, and I hope it continues rising in popularity. I think once it hits v1.0, we'll begin to see a lot more companies adopting its usage for data science, statistics, and machine learning.
> I think once it hits v1.0, we'll begin to see a lot more companies adopting its usage for data science, statistics, and machine learning.
I don't. It doesn't have the incumbency of R, the use in other areas of programming of Python, or a company actively marketing it like MATLAB. It's not 5x or 10x or whatever good enough than the alternatives to assert itself in the playing field.
If it means you get your work done using it, be all means use it. But I think it will stay around clojure levels of use in data science, statistics, and machine learning.
Once you can compile a julia app- front end, back and probabilistic prog/ML to web assembly and have it run in browser and mobile, it will skyrocket in popularity.
Why? The only current benefit of Julia afaict is the tracing jit. If you run the tracing jit on web assembly then it's giving up most of its performance benefits. And Python could be built on web assembly as well.
But who knows. Weird things seem to become popular despite all the negative points.
Forgive me, it's not a tracing JIT but just LLVM's JIT.
Precompiling in Julia is extremely not-straight-forward. You would think you just use --compile and it would work; but it doesn't at all.
Also, at ~850kb, Python's runtime is not that hefty. It's intended to be embedded and while it's quite a bit larger than lua's 200kb, but smaller than libjulia's 16mb.
I'm surprised that the author combines "the C languages" saying that most adverts that mention any of C/C++/C# mention all three. In my experience there is a large difference between companies searching for C# developers and those searching for C/C++. After blurring this distinction he concludes that R and Python are "very different languages" while I consider them to be largely overlapping.
Mostly agree, but I've also seen a fair bit of C/C++ skills associated with jobs involving *nix development environments. I've rarely seen C++ associated with embedded jobs (even though I've read that it can certainly be used if care is taken to avoid things like dynamic dispatch, etc)
I hate these sort of comparisons. You've got R, Python, SAS... okay, those are sort of similar. Then you've got Java and "C, C++ or C#," and man, including C# with C/C++ is... fraught. Then you've got Hadoop, Spark, Hive... okay, those are all kind of different from what we've had so far. Now you've got Tableu and RapidMiner. Uh. In the second chart, "Microsoft" is included as a keyword. Okay. It's just... comparing apples, oranges, bananas, grapes and pears. What's it supposed to tell us?
"Microsoft was a difficult search since it appears in data science ads that mention other Microsoft products such as Windows or SQL Server. To eliminate such over-counting, I treated Microsoft different from the rest by including product names such as Azure Machine Learning and Microsoft Cognitive Toolkit. So there’s a good chance I went from over-emphasizing Microsoft to under-emphasizing it with only 157 jobs."
Okay, but what does that mean? Microsoft Cognitive Toolkit is like Tensorflow, which is included as an item on the list. Azure Machine Learning is something you can script with R or Python, and allows you to create APIs for predictive use on Azure. What good does lumping those together do? And what does it tell us that some things that are basically an R/Python library are less popular than R or Python themselves? It's this weird, uneven mix of things. Some are programming languages, some are libraries or frameworks, some are end-user products like Tableau.
This is pretty much what data science in the real world is like.
Define the question you are interested in (in this case, a somewhat reasonable attempt to compare R/Python/SAS) and the put other things in blobs with a note that says this is what this blob is. Enjoy.
Just playing around with the search terms from that second linked article is also interesting - it would appear that many terms ("machine learning", "data science", "predictive modeling", some others) show that Amazon has the largest number of job listings from a single company - for "machine learning" Amazon shows 1706 listings out of 12499 or almost 14% of all listings . . . The way Amazon also pops out in other data science term searches is also interesting - at least in their job listings, Amazon seems to really be attempting to slurp up candidates with deeper data and stats skills.
For some time I have been somewhat cynical about data science. My impression has been that much of what has been pushed as data science jobs is thinly veiled data reporting gigs (just plain old business intelligence). While I still think data science is over-hyped, I think I need to reconsider just how critical it will be as a knowledge base or skill set. While there may not be a large number of deep learning jobs out there, the expectation that a data hacker can be expected to perform a linear or logistic regression against a set of gathered and cleaned data may be closer to fizz buzz than I previously assumed.
I am teaching an introductory programming class (using Python) this semester and students are definitely focused on data science as a career track.
I once worked on a piece of software communicating with a SAS instance (doing live decision management), but everything about it seemed sketchy. No one using it really liked it, and its internals always appeared like a blackbox to me. Also, either it is terribly engineered or the devs working with it were just bad - we wanted a JSON-based REST API, but they said that it's "not possible with SAS", so we fell back to badly organized HTTP calls with XML.
Does anybody have some insights about internal quality and code "health" in SAS?
From experience I've found the "core" parts SAS Base, SAS GRAPH, OR, ETS etc are very stable. other parts we have had issues with in the past - Stored processes, LSF for scheduled jobs.
SAS tech support have been pretty good though I believe my workplace pays a lot for the privilege of being able to contact them direct.
One thing I have found is you can usually throw something together in a pretty "hackish" manner but typically there is a better and more optimal method of doing it which will be much more stable. Sometimes it doesn't hurt to ask tech support "What is the recommended method for doing..." Code which abuses the SAS macro language is especially notorious and a good candiate for asking this.
A quick google tells me first class JSON support is pretty new (dec 2016).
Nothing exciting to report. It was a resource-sizing problem and they didn't adequately separate prod and dev environments. Someone choked the dev and the prod went down with it.
What makes Python a superior language to Matlab and R is the ease of software development. It's an easy, pleasant language to work in, and I trust it for production tasks (I've written production R code and it's fairly hard to read and fragile).
What's even better is data science is moving 100% into Python 3 (from 2) by 2020:
http://www.python3statement.org/