When working with written natural language data as we do with many natural language processing models, a step we typically carry out while preprocessing the data is tokenization. In a nutshell, tokenization is the conversion of a long string of characters into smaller units that we call tokens.
Read MoreFiltering by Category: SuperDataScience
Analyzing Blockchain Data and Cryptocurrencies
As real-time, publicly-available ledgers of transactions, blockchains provide exciting new data analytics opportunities. Kimberly Grauer leads us through the tools and approaches for blockchain analytics.
Kim:
• Is Director of Research at Chainalysis Inc., the world’s leading crypto analytics firm.
• Previously worked in an economic research and analysis group for NYC.
• Holds a Masters in Political Theory from the University of Oxford, a Master of Public Administration from the London School of Economics, and she completed the General Assembly Data Science bootcamp.
Today’s episode will appeal primarily to folks who are interested in blockchains and cryptocurrencies, particularly those keen to perform data analysis on blockchain data.
In this episode, Kim details:
• The unique real-time economic-data analytics opportunities that blockchains provide.
• Examples of her own research on blockchain data, such as analyses of illegal activity and global crypto adoption.
• The tools and approaches she uses daily to analyze and report on blockchain data.
• Where the evolutions of crypto, blockchains, and data science are going together.
• Why a data science bootcamp could be exactly the right thing for you if you’re looking to break into the field.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Imagen Video: Incredible Text-to-Video Generation
For today’s Five-Minute Friday episode, it’s my pleasure to introduce you to the Imagen Video model published upon just a few weeks ago by researchers from Google.
Read MoreData Analyst, Data Scientist, and Data Engineer Career Paths
Keen to become a Data Analyst? Get promoted to Sr Data Analyst? Or explore Data Engineer/Scientist options? Shashank, a YouTube expert on these questions (>100k subscribers!) tackles them in today's episode.
Shashank:
• Has an exceptional YouTube channel focused on helping people break into a data analyst career.
• Works as a Senior Data Engineer at digital sports platform Fanatics, Inc.
• Was previously Data Analyst at luxury retailer Nordstrom and other firms.
• Holds a degree in chemistry from Emory University in Atlanta.
Today’s episode will appeal primarily to folks who are interested in becoming a data analyst, or who are interested in transitioning from a data analyst role into a data science or data engineering role.
In this episode, Shashank details:
• How you can land an entry-level data analyst role in just a few weeks, regardless of your educational and professional background.
• The hard and soft skills you need to progress from a junior data analyst to a senior data analyst position.
• What it takes to transition from data analyst to a typically more lucrative role as a data scientist or data engineer.
• His favorite resources for learning the essential skills for data scientists.
What he looks for when he’s interviewing candidates.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Burnout: Causes and Solutions
What really is Burnout? What causes it? And how can you prevent or treat it? Prof. Christina Maslach — world-leading researcher and author on Burnout — joins me for today's episode to unpack these questions.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Blockchains and Cryptocurrencies: Analytics and Data Applications
Today's episode introduces what Blockchains are, what Crypto is, and Data Science applications of these technologies. Philip Gradwell of globally-renowned Chainalysis Inc. is our brilliant guide.
Philip:
• Is Chief Economist at Chainalysis, the world’s leading crypto analytics firm — their analysis is regularly featured by major news outlets.
• Previously worked as Principal at Vivid Economics, where he helped grow the consulting firm to 40 people, eventually culminating in its acquisition by consulting giant McKinsey & Company.
• Holds a Master’s in Economics from UCL and a PPE degree — that’s Philosophy, Politics, and Economics — from the University of Oxford.
Today’s episode will appeal to anyone looking for an introduction to the blockchain and cryptocurrencies. It’ll hold special appeal for people keen to do data science with these technologies.
In this episode, Philip details:
• Similarities and differences between analyzing cryptocurrencies and the established fiat currencies.
• His crypto data analytics pipeline.
• How he develops data products for a wide range of users, including businesses, banks, governments, and law enforcement.
• How the blockchain facilitates innovative computing and machine learning technologies.
• What he looks for in the data scientists he hires.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
OpenAI Whisper: General-Purpose Speech Recognition
One of the challenges holding machines back from approaching human-level speech recognition like Whisper has has been acquiring sufficiently large amounts of high-quality, labeled training data. “Labeled” in this case means audio of speech that has a corresponding text associated with it. With enough of these labeled data, a machine learning model can learn to take in speech audio as an input and then output the correct corresponding text.
Read MoreTools for Deploying Data Models into Production
Today's guest is mighty Erik Bernhardsson — creator of Spotify's music recommender, prolific open-source developer, world-leading technical blogger, and now model-deployment-tool entrepreneur via Modal Labs.
Erik:
• Is the Founder and CEO of Modal Labs, a startup building innovative tools and infrastructure for data teams.
• Previously was CTO of the real estate startup Better, where he grew the engineering team from the size of 1 — himself — to 300 people.
• Was also previously an Engineering Manager at Spotify, where he created their now-ubiquitous music-recommendation algorithm.
• Is a prolific open-sourcer, having created the popular Luigi and Annoy libraries, among several others.
• Is an industry-leading blogger with posts that frequently feature on the front page of Hacker News.
Today’s episode gets deep into the weeds at points, so it will be particularly appealing to practicing data scientists, ML engineers, and the like, but much of the fascinating, wide-ranging conversation in this episode will appeal to any curious listener.
In this episode, Erik details:
• How the Spotify music recommender he built works so well at scale.
•The litany of new data science and engineering tools he’s excited about and thinks you should be excited about too.
•What open-source library he would develop next.
•Why he founded his Modal and how their tools empower data teams.
• Having interviewed more than 2000 candidates for engineering roles, his top tips both for succeeding as an interviewer and as an interviewee.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The Joy of Atelic Activities
You might think to yourself “I could be spending this time productively!” But pushing past these inner calls for productivity and leaning into the initial discomfort of atelic activities is likely to be rewarding. When you’re consumed by telic activities, by always pursuing outcomes, you’re missing out on being, on appreciating being alive for the fleeting moments that you have.
Read MoreCausality in Sequential Data
Inferring Causality is uniquely powerful when done with Sequential Data: data unfolding over time. Forecasting guru Dr. Sean Taylor — renowned for Prophet and now Motif Analytics co-founder — leads us through the topic.
Sean:
• Is Co-Founder and Chief Scientist of Motif Analytics, a startup that blends his deep expertise in causal modeling with sequential analytics.
• Previously worked as a Data Science Manager at Lyft.
• Also worked as a Research Scientist Manager at Facebook, where he led the development of the renowned open-source forecasting tool, Prophet.
• Holds a PhD in Information Systems from New York University and a BS in Economics from the University of Pennsylvania.
Today’s episode gets deep into the weeds on occasion, particularly when discussing making causal inferences, but most of the episode will resonate with any curious listener.
In this episode, Sean:
• Publicly unveils his new venture, filling us in on why now was the right time for him to co-found and lead data science at an ML startup.
• Details what causal modeling is, why every data scientist should be familiar with it, and how it can make a real-world impact, with many illustrative examples from his time at Lyft.
• Fills us in on the infrastructure and teams required for large-scale causal experimentation.
• Covers how causal modeling and forecasting can’t be fully automated today as it requires humans to make assumptions, but also how humans can make these assumptions in a more informed manner thanks to data visualizations.
• Explains what the field of Information Systems is and, having conducted several hundred interviews, what he looks for in the data scientists he hires.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Data Science Interviews with Nick Singh
For an episode all about tips for crushing interviews for Data Scientist roles, our guest is Nick Singh — author of the bestselling "Ace the Data Science Interview" book and creator of the DataLemur SQL interview platform.
Nick:
• Co-authored “Ace the Data Science Interview”, an interview-question guide that has sold over 16,000 copies since it was released last year.
• Created the DataLemur platform for interactively practicing interview questions involving SQL queries.
• Worked as a software engineer at Facebook, Google, and Microsoft.
• Holds a BS in engineering from the University of Virginia.
Today's episode is ideal for folks who are looking to land a data science job for the first time, level-up into a more senior data science role, or perhaps land a data science gig at a new firm.
In this episode, Nick details:
• His top tips for success in data science interviews.
• Common misconceptions about data science interviews.
• How to become comfortable with self-promotion and increase your chances of landing your dream job.
• Strategies for when interviewers ask if you have any questions for them.
• The subject areas and skills you should master before heading into a data science interview.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Causal Machine Learning
Causal ML is today's focus with Dr. Emre Kiciman — Senior Principal Researcher at Microsoft, developer of the DoWhy causal modeling library for Python, and a leader in applying causal research to social sciences.
Emre:
• Has worked within prestigious Microsoft Research for over 17 years.
• Leads Microsoft’s research on Causal Machine Learning.
• Leads development of the DoWhy open-source causal modeling library for Python (part of the PyWhy GitHub project).
• Pioneered the use of social media data to answer causal questions in the social sciences, such as with respect to physical and mental health.
• Has published 100+ papers and been cited 8000+ times.
• Holds a PhD in Computer Science from Stanford University.
Today’s episode is relatively technical, so will probably appeal primarily to folks with technical backgrounds like data scientists, ML engineers, and software developers.
In this episode, Emre details:
• What Causal ML is and how it’s different from "correlational" ML.
• The four key steps of causal inference and how they impact ML.
• The types of data that are most amenable to causal methods and those that aren’t yet… but may be soon.
• Exciting real-world applications of Causal ML.
• The software tools he most highly recommends.
• What he looks for in the data science researchers he hires.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
More Guests on Fridays
Going forward, we are still going to have short, five-minute-ish episodes on Friday that feature me solo, but we will increasingly be interspersing in inspiring guests. And I won’t be making an effort to have these Friday guest episodes be anywhere near five minutes long — to start, I’m thinking of having them typically be 20 to 30 minutes long, but we’ll see how it goes with the guests and what the reception is like from you.
Read MoreOpen-Ended A.I.: Practical Applications for Humans and Machines
In today's remarkable episode, Dr. Kenneth Stanley uses evidence from his machine learning research on Open-Ended A.I. and evolutionary algorithms to inform how you as a human can achieve great life outcomes.
Ken:
• Co-authored the book "Why Greatness Cannot be Planned", a genre-defying book that leverages his ML research to redefine how a human can optimally achieve extraordinary outcomes over the course of their lifetime.
• Was until recently Open-Endedness Team Leader at OpenAI, one of the world’s top A.I. research organizations.
• Led Core A.I. Research for Uber A.I.
• With Prof. Gary Marcus and others, founded A.I. startup Geometric Intelligence, which was acquired by Uber.
• Was Professor of Computer Science at the University of Central Florida.
• Holds a dozen patents for ML innovations, including open-ended and evolutionary (especially neuroevolutionary) approaches.
Today’s episode does get fairly deep into the weeds of ML theory at points so may be best-suited to technical practitioners. That said, the broad strokes of the episode could be not only informative but, again, could indeed be life-perspective-altering for any curious listener.
In this episode, Ken details:
• What genetic ML algos are and how they work effectively in practice.
• How the Objective Paradox — that you fail to achieve an objective you seek — is common across ML and human pursuits.
• How an approach called Novelty Search can lead to superior outcomes than pursuing an explicit objective, again both for machines and humans.
• What Open-Ended A.I. is and its intimate relationship with AGI, a machine with the same learning potential as a human.
• His vision for how A.I. could transform life for humans in the coming decades.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Who Dares Wins
Even if we don’t achieve what we originally set out to achieve, by having dared to achieve it, by having taken action in the direction of the achievement, we learn from the experience and we gain invaluable information about ourselves and the world. Having dared, we find ourselves at a new, enriched vantage point that we otherwise would never have ventured to. From there, whether we achieved the original goal or not, we can iterate — dare again — perhaps to achieve success at the original objective or perhaps we identify some entirely new objective that would have otherwise been inconceivable without having dared.
Read MoreData Mesh
"Data Mesh" may be the trendiest term in data science. What is it and how will its Distributed A.I. transform your organization? The founder of the Data Mesh concept herself, Zhamak Dehghani, explains in this episode.
Zhamak:
• Authored the O'Reilly Media book "Data Mesh" and also co-authored an O’Reilly book on software architecture.
• Is newly the CEO and founder of a stealth tech startup reimagining the future of the data developer experience though the Data Mesh.
• Previously worked as a software engineer, software architect, and as a technology incubation director.
• Holds a Bachelor of Engineering degree in Computer Software from the Shahid Beheshti University in Iran and a Masters in Information Technology Management from the University of Sydney in Australia.
Today’s episode should be broadly interesting to anyone who’s keen to get a glimpse of the future of how organizations will work with data and A.I.
In this episode, Zhamak details:
• What a data mesh is.
• Why data meshes are essential today and will be even more so in the coming years.
• The biggest challenges of distributed data architectures.
• Why now was the right time for her to launch her own data mesh startup.
• Her tricks for keeping pace with the rapid of pace of tech progress.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Daily Habit #11: Assigning Deliverables
To ensure that deliverables are assigned, if you’re running the meeting you can formally set the final meeting agenda item to be something like “assign deliverables”. If you’re not running the meeting, you can suggest having this final agenda item to the meeting organizer at the meeting’s outset or even as the meeting begins to wrap up. By assigning deliverables in this way, we not only make the best use of everyone’s time going forward, but we also maximize the probability that all of the essential action items are actually delivered upon.
Read MoreInferring Causality with Jennifer Hill
Inferring causal direction — as opposed to merely identifying correlations — is central to all real-world data science applications. World-leading expert and author on causality, Prof. Jennifer Hill, is our guest this week.
Jennifer:
• Is Professor of Applied Statistics at New York University, where she researches causality and practical applications of causal research, such as those that are vital to scientific development and government policies.
• Co-directs the NYU Masters in Applied Statistics and directs PRIISM (a center focused on impactful social applications of data science).
• With the renowned statistician Andrew Gelman, wrote the book "Data analysis using regression and multilevel/hierarchical models", an iconic textbook that has been cited over 15k times.
• Holds a PhD in Statistics from Harvard University.
Intended audience:
• Today’s episode largely contains content that will be of interest to anyone who’s keen to better understand the critical concept of causality.
• It also contains technical parts that will appeal primarily to practicing data scientists.
In this episode, Jennifer details:
• How causality is central to all applications of data science.
• How correlation does not imply causation.
• How to design research in order to confidently infer causality from the results.
• Her favorite Bayesian and machine learning tools for making causal inferences within code.
• ThinkCausal, her new graphical user interface for making causal inferences without the need to write code.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Upskilling in Data Science and Machine Learning
This week, iconic Stanford University Deep Learning instructor and entrepreneur Kian Katanforoosh details how ML powers his EdTech platform Workera, enabling you to systematically fill gaps in your data science skills.
Kian:
• Is Co-Founder and CEO of Workera, a Bay Area education technology company that has raised $21m in venture capital to upskill workers, with a particular early focus on upskilling technologists like data scientists, software developers, and machine learning specialists.
• Is a lecturer of computer science at Stanford University (specifically, he teaches the extremely popular CS230 Deep Learning course alongside Prof. Andrew Ng, one of the world’s best-known data scientists).
• Was awarded Stanford’s highest teaching award.
• Is also a founding member of DeepLearning.AI, a platform through which he’s taught over three million students deep learning.
• Holds a Masters in Math and Computer Science from CentraleSupélec.
• Holds a Masters in Management Science and Engineering from Stanford.
By and large, today’s episode will appeal to any listener who’s keen to understand the latest in education technology, but there are parts here and there that will specifically appeal to practicing technologists like data scientists and software developers.
In this episode, Kian details:
• What a skills intelligence platform is.
• Four ways that machine learning drives his skills intelligence platform.
• What frameworks and software languages they selected for building their platform and why.
• What he looks for in the data scientists and software engineers he hires.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Geospatial Data and Unconventional Routes into Data Careers
This week, the remarkably well-read Christina Stathopoulos, details open-source software for working with geospatial data... as well as how you can navigate your data-career path, no matter what your background.
Christina:
• Has worked at Google for nearly five years in several data-centric roles.
• For the past year, she’s worked as an Analytical Lead for Waze, the popular crowdsourced navigation app owned by Google.
• Is also an adjunct professor at IE Business School School in Madrid, where she teaches courses on business analytics, machine learning, data visualization, and data ethics.
• Previously worked as a data engineer at media analytics giant Nielsen.
• Holds a Master’s in Business Analytics and Big Data from IE Business School and a Bachelor’s in Science, Tech, and Society from North Carolina State University.
Today’s episode will appeal to a broad audience of technical and non-technical listeners alike.
In this episode, Christina details:
• Geospatial data and open-source packages for working with it.
• Her tips for getting a foothold in a data career if you come from an unconventional background.
• Guidance to help women and other underrepresented groups thrive in tech.
• The hard and soft skills most essential to success in a data role today.
• Her #bookaweekchallenge and her top data book recommendations.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.