This article was originally adapted from a podcast, which you can check out here.
Three weeks ago for Five-Minute Friday, I covered the highest-paying programming languages for data scientists based on the results of O’Reilly’s 2021 Data/AI Salary Survey. Two weeks ago, we used Five-Minute Friday to get our definitions of data tools and data frameworks straight so that last week we could dig into the highest-paying data tools and now, this week we’ll wrap this series on compensation up by covering the highest-paying data platforms. If you get through today’s episode and don’t feel 100% clear about what a data platform is then consider popping back to Episode #522 to clarify.
Right off the bat, there are two general trends with data frameworks that I’d like to highlight. The first is that, similar to what we observed with data tools last week, familiarity with any open-source data framework is associated with higher pay.
The second big general trend is that, similar to what we observed with data tools last week and programming languages three weeks ago, the highest salaries of all are associated with relatively new software that does not yet have a lot of users but does have a lot of buzz associated with it. It appears that employers are willing to pay a lot more when they do find one of those rare few who have expertise with these fashionable new software packages. (In addition, it’s also worth mentioning that it’s easier for small groups to stray further from the overall mean across all groups so we also shouldn’t draw too strong conclusions from the rarest data frameworks.)
Those general trends and the glaring small-sample-size caveat out of the way, the four data frameworks associated with the highest pay are all indeed used by fewer than 1% of survey respondents. ContentSquare, a company that has raised half a billion dollars from big-name venture capital firms like SoftBank to create an analytics platform for tracking customers’ digital experiences, came top: Folks who use it have an average salary of $225k, which is a whopping $80k above the $146k mean across all respondents.
Michelangelo, a platform developed by the ride-sharing company Uber to deploy and operate machine learning models in production, came second. People familiar with it also on average enjoy an enormous bump in pay to $218k from the $146k overall mean.
ContentSquare and Michelangelo were head and shoulders above Ray (an open-source project for scaling computationally intensive Python code) and Amundsen (an open-source catalog, this time from ride-sharing giant Lyft, for storing metadata). Despite being in third and fourth amongst data frameworks, Ray and Amundsen are nevertheless both associated with average pay higher than any of the data tools covered last week or any of the programming languages covered three weeks ago. Ray came in at $191k while Amundsen was $189k.
This seems like a good juncture to reiterate that all four of the frameworks covered so far — ContentSquare, Michelangelo, Ray, and Amundsen — are all used by fewer than 1% of survey respondents so their massive salary bumps do suffer from small sample-size issues.
In contrast, more popular data frameworks like Kafka, Spark, Google BigQuery, and Dask — which are used by between 5% and 19% of all respondents — did not suffer from small sample-size problems but are nevertheless associated with salaries considerably above the $146k mean. Kafka leads this pack with a $179k average while Spark, BigQuery, and Dask were all around $170k.
Recalling that all the data frameworks are associated with at least some increase in salary relative to the mean, the relatively low performers in the bunch included older frameworks like Hadoop as well as commercial ones like Tableau, Oracle BI, and Google Analytics.
So what are the takeaway messages from all of this? In my view, relatively widely adopted but nevertheless greatly in-demand open-source frameworks for handling large-scale, distributed data streaming like Kafka, Spark, and Dask are your best bet for frameworks to consider learning next. You could also take a peek at the open-source Michelangelo, Ray, and Amundsen projects to see if these still-relatively-niche frameworks are useful to any projects you’re currently tackling.
That’s the end of this four-part series of Five-Minute Fridays on the highest-paying programming languages for data scientists as well as the highest-paying data tools and data frameworks. If you’d like to check out the full salary report from O’Reilly that I based these episodes on or any of data frameworks mentioned in this episode, we’ve included links in the show notes.