This article was originally adapted from a podcast, which you can check out here.
Two weeks ago for Five-Minute Friday, I covered the highest-paying programming languages for data scientists based on the results of O’Reilly’s 2021 Data/AI Salary Survey. Last week we used Five-Minute Friday to get our definitions of data tools and data frameworks straight so that today we could dig into the highest-paying data tools — while next week, in turn, we’ll tackle the highest-paying data platforms. If you get through today’s episode and don’t feel 100% clear about what a data tool is then consider popping back to Episode #522 to clarify.
The most widely-used tool in the survey — used by nearly a third of respondents — was Microsoft’s Excel program for working with data in spreadsheets. Despite its popularity, Excel — along with other click-and-point tools in the survey — was associated with a below-average salary. Specifically, the mean across all respondents was $146k but those who indicated that they used Excel were paid on average $8k/year less at $138k.
The three next-most popular tools after Excel were the Python programming language-based software libraries scikit-learn, TensorFlow and PyTorch. More specifically, scikit-learn is used by a little over a quarter of respondents while TensorFlow and PyTorch are both used by about a fifth. In contrast with Excel, however, all three of these popular machine learning-focused software libraries were associated with above-average salaries. PyTorch and TensorFlow in particular were associated with a juicy salary pop of about $20k above the overall mean, coming in at $166k for PyTorch and $164k for TensorFlow. The scikit-learn jump was about half as large, giving an $11k average increase in pay above the $146k overall mean.
Interestingly, expertise with almost any data tool was associated with above-average pay, the exceptions being Excel, Stata, and tools provided by the once-prestigious computing giant IBM. Since these tools are all commercial, a general conclusion we can draw across all of these results is that familiarity with commercial tools tends to pay below-average salaries while familiarity with open-source tools pay above-average.
Ok, so we’ve covered the most popular tools now as well as the ones associated with below-average salaries. On the flipside, the highest-paying tools of all were relatively unpopular, which makes sense because it’s easier for smaller groups to stretch further away from the global mean across all groups.
The tool associated with the highest pay of all was H2O, an open-source machine learning tool, which is used by only 3% respondents — but those respondents had an average pay of $183k, nearly a whopping $40k above the overall mean salary. It’s a similar story for second-placed KNIME, an open-source analytics tool that is used by only 2% of respondents but has an average pay of $180k.
The third- and fourth-ranked tools for pay are both part of Apache Spark — a framework we’ll talk about more next week. Specifically, Spark NLP is used by only 5% of respondents and was just a grand behind KNIME with average compensation of $179k while Spark MLlib is used by nearly a tenth of respondents and average comp of $175k.
Honorable mention goes to spaCy, a Python library for working with natural language data, that came sixth in the survey — ahead of the more popular Python libraries scikit-learn, TensorFlow, and PyTorch — but wasn’t associated with salaries quite as high as H2O, KNIME, or the Spark tools.
Overall, similar to the programming languages we looked at back in episode 520, the general conclusion to draw is that employers seem to be willing to pay a premium for expertise with relatively new open-source software tools that are generating a lot of buzz — especially if finding people who are already familiar with using these tools are hard to come by.
Ok, so we’ve now covered the highest-paying programming languages and the highest-paying data tools. Next week we’ll conclude the series of Five-Minute Fridays on this compensation topic by covering the highest-paying data platforms — like Spark, which we already mentioned, as well as others like Kafka, Hadoop, and Dask.
If you’d like to check out the full salary report from O’Reilly in the meantime we’ve included a link in the show notes. We’ve also included links to all of the data tools mentioned in today’s episode.
All right, that’s it for today. Keep on rockin’ it out there folks and I’ll catch you on another round of SuperDataScience very soon.