Starting today and running for four consecutive weeks, Five-Minute Friday episodes of SuperDataScience feature Ben Taylor as my guest. Each week, he answers a specific ML commercialization or education question.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Filtering by Category: Interview
Simulations and Synthetic Data for Machine Learning
Running Simulations and generating Synthetic Data in order to create more-powerful Machine Learning models is this week's topic. Bewilderingly interesting two-time book author Mars Buttfield-Addison is our guest.
Mars:
• Is co-author of two O'Reilly Media books, "Practical Simulations for Machine Learning" and "Practical Artificial Intelligence with Swift".
• Is pursuing a PhD in computer engineering from the University of Tasmania, focused on writing high-performance software to track space objects.
• Teaches courses on A.I. and data science at the University of Tasmania.
• Is a regular speaker at top tech conferences around the world.
• Holds a bachelor’s degree in software development and data modeling.
Today’s episode should be equally fascinating to technical and non-technical folks alike.
In this episode, Mars details:
• What simulations and synthetic data are, and why they can be invaluable for real-life applications.
• How simulated bots can solve any problem by representing the problem as a 3D visualization.
• Why the mobile operating system language Swift is interesting for A.I.
• How much junk there is in space and why it’s critical we track it.
• What it’s like creating video games in a “secret” Tasmanian games lab.
• Whether programming or statistical skills are more important in data science.
• Why you might want to do a data science internship in industry if you’re thinking of having a career in academia.
Thanks to Suzanne Huston for introducing me to Mars :)
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Narrative A.I. with Hilary Mason
Hilary Mason, one of the world's best-known data scientists, fills us in on A.I. systems that generate interactive story narratives and on building a thriving early-stage A.I. company. This episode was filmed live on stage — so fun!
Hilary:
• Is Co-Founder and CEO of Hidden Door, a start-up that leverages narrative A.I. to generate unique, customized dialog and graphics in real time, thereby delivering a groundbreakingly immersive video game experience.
• Was previously Founder and CEO of Fast Forward Labs, an emerging-tech research company that was acquired by Cloudera.
• Was Data-Scientist-in-Residence at Accel, a leading venture capital firm.
• Co-founded several iconic tech communities in New York such as DataGotham and HackNY.
• Studied computer science at Brown University and Grinnell College.
• Is known for sharing useful data science knowledge with the public; she has over 120k followers on Twitter and over 160k followers on LinkedIn.
The first half of today’s episode contains some technical elements but by and large the episode should be appealing to anyone who’s keen to be on the cutting edge of machine learning application and commercialization.
In today’s episode, Hilary details:
• How narrative A.I. can assist creativity.
• How to build ML products with no quantitative error function to optimize.
• How to prevent A.I. systems from outputting non-sense or explicit content.
• The emerging ML technique she’s most excited about.
• What it takes to be successful as CEO of an early-stage A.I. company.
• How she’s hopeful A.I. will transform our lives for the better in the future.
Thank you to Jared Lander and Nicole DelGiudice of the New York R Conference for providing us with an amazing live forum to host a live SDS episode and for the exceptional footage. And thanks to Claudia Perlich for introducing me to Hilary!
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Data Engineering for Data Scientists
Prolific data science content creator 🎯 Mark Freeman details what Data Engineering is and why it's a critically useful subject area for data scientists to be proficient in. Hear all about it in this week's episode.
Mark:
• Is a Senior Data Scientist, with a Data Engineering specialization, at Humu (startup that has raised $100m in venture capital).
• Posts data science and software engineering tips daily on LinkedIn.
• Previously was data scientist at Verana Health and data analyst at the Stanford University School of Medicine.
• Also holds a Master’s in Community Health and Prevention Research from the Stanford medical school.
Today’s episode is geared toward listeners who are already in a technical role such as data scientists, data engineers, ML engineers, or software engineers — as well as to folks who’d like to grow into these kinds of roles.
In today’s episode, Mark details:
• The differences between junior, senior, and staff data scientists.
• What it takes to get promoted into more senior data science roles.
• How data engineering differs from data science.
• His top tools for data extraction, modeling, and pipeline engineering.
• His top tip for getting hired at a fast-growing VC-backed startup.
• How behavioral nudges can drastically improve workplace experiences.
• Why all data scientists should be interested in web3.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
PyMC for Bayesian Statistics in Python
Learn how Bayesian Statistics can be more powerful and interpretable than any other data modeling approach from Dr. Thomas Wiecki, a Core Developer of PyMC — the leading Bayesian software library for Python.
Thomas:
• Has been a Core Developer of PyMC for over eight years.
• Is Co-Founder and CEO of PyMC Labs, which solves commercial problems with Bayesian data models.
• Previously, he worked as VP Data Science at Quantopian Inc.
• Holds a PhD in Computational Neuroscience from Brown University.
Today’s episode is more on the technical side so will appeal primarily to practicing data scientists.
In this episode, Thomas details:
• What Bayesian statistics is.
• Why Bayesian statistics can be more powerful and interpretable than any other data modeling approach.
• How PyMC was developed and how it trains models so efficiently.
• Case studies of large-scale Bayesian stats applied commercially.
• The extra flexibility of *hierarchical* Bayesian models.
• His top resources for learning Bayesian stats yourself.
• How to build a successful company culture.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The State of Natural Language Processing
As the LaMDA "sentience" hubbub highlights, Natural Language Processing is perhaps the most exciting and rapidly accelerating area of Machine Learning. Hear all about NLP from the deep expert Rongyao HUANG.
(LaMDA is definitely not sentient, by the way... but it is an impressive display of state-of-the-art conversational machine capabilities.)
Rongyao:
• Is Lead Data Scientist at CB Insights, a marketing intelligence platform.
• Previously she worked as a data scientist at a number of other New York start-ups and as a quantitative research assistant at Columbia University.
• She holds a masters in research methodology and quantitative methods from Columbia University in the City of New York.
Today’s episode is more on the technical side so will appeal primarily to practicing data scientists, however the second half of the episode does contain general sage guidance for anyone seeking to navigate career options as well as to balance personal and professional obligations.
In today’s episode, Rongyao details:
• The evolution of NLP techniques over the past decade through to the large transformer models of today.
• The practical implications of this dramatic NLP evolution.
• How the “scaling law” will impact NLP model capabilities over the coming decade.
• The major limitations of today’s NLP approaches and how we might overcome them.
• Her Bauhaus-inspired model for effective data science.
• Her pathfinding model for making effective career choices.
• Her top tips for staying sane while juggling career and family.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Live podcast recording with Hilary Mason at New York R Conference
Thanks to data science legend Hilary Mason and the engaging audience at the New York R Conference for making Friday's live-filmed episode of the SuperDataScience podcast an exhilarating and illuminating success ⚡️
Look out for Hilary's episode as #589, which will be released on July 5th.
Bayesian, Frequentist, and Fiducial Statistics in Data Science
Harvard stats prof Xiao-Li MENG founded the trailblazing Harvard Data Science Review. We cover that and why BFFs (Bayesians, frequentists and fiducial statisticians) should be BFFs (best friends forever).
Xiao-Li:
• Is the Founding Editor-in-Chief of the Harvard Data Science Review, a new publication in the vein of the renowned Harvard Business Review.
• Has been a full professor in Harvard’s Dept of Statistics for 20+ years.
• Chaired the Harvard Stats Dept for 7 years.
• Was Dean of Harvard’s Grad School of Arts and Sciences for 5 years.
• Has published 200+ journal articles on statistics, machine learning, and data science, and been cited over 25,000 times.
• Holds a PhD in Statistics from — yep! — Harvard.
Today’s episode will be of interest to anyone who’s keen to better understand the biggest challenges and most fascinating applications of data science today.
In the episode, Xiao-Li details:
• What the Harvard Data Science Review is, why he founded it, and the most popular topics covered by the Review so far.
• The concept of “data minding”.
• Why there’s no “free lunch” with data — tricky trade-offs abound no matter what.
• The surprising paradoxical downside of having lots of data.
• What the Bayesian, Frequentist, and Fiducial schools of statistics are and when each of them is most useful in data science.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Transforming Dentistry with A.I.
Engineer and computer scientist Dr. Wardah Inam has raised $79m in venture capital to transform dentistry with machine learning. Hear about it, as well as her tips for scaling an A.I. company, in this week's episode.
Wardah:
• Is Co-Founder/CEO of Overjet, which is transforming dentistry with ML.
• Co-founded uLink Technologies, a start-up behind A.I.-driven power grids.
• Served as Lead Product Manager at Q Bio, a healthcare A.I. start-up.
• Was a Postdoc in MIT’s renowned CSAIL (Computer Science and A.I. Lab).
• Holds an MIT PhD in electrical engineering and computer science.
Today’s episode focuses more on practical applications of ML and growing an A.I. company than getting into the nitty-gritty of ML models themselves, so it should be broadly appealing to both technically-oriented and business-oriented folks.
In the episode, Wardah details:
• How Overjet not only classifies images but quantifies dental diagnoses with computer vision, enabling models to answer questions like “how large is this cavity?”
• How natural language processing can be essential for determining the correct dental diagnosis.
• The data-labeling challenges firms like Overjet need to overcome to enable ML models to learn from noisy, real-world data.
• Her tips for building a successful A.I. business.
• What she looks for in the data scientists and software engineers she hires.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Scaling A.I. Startups Globally
Sensational A.I. entrepreneur Husayn Kassai co-founded Onfido while an undergrad and served as its CEO for ten years, raising $200m in venture capital. Hear his tips for scaling your own A.I. firm in this week's episode.
Husayn:
• Co-founded the ML company Onfido in 2010, while he was an undergraduate student at the University of Oxford.
• Served as Onfido’s CEO for ten years, overseeing $200m in venture capital raised, the team growing to over 400 employees, and the client base growing to over 1500 firms.
• Holds a degree in economics and management from Oxford.
• Served as the full-time President of the Oxford Entrepreneurs student society, which is how I got to know him more than a decade ago.
Today’s episode is non-technical and will appeal to anyone who’s interested in hearing tips and tricks for building a billion-dollar A.I. start-up from scratch.
In the episode, Husayn details:
• Tips for deciding on whether you need co-founders.
• How to choose your co-founders if you need them.
• Finding product-market fit.
• How to scale up a company.
• How to identify start-up opportunities.
• Why there’s never been a better time than now to found an A.I. startup.
• A look at his next startup, which is currently in stealth.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Optimizing Computer Hardware with Deep Learning
The polymath Dr. Magnus Ekman joins me from NVIDIA today to explain how machine learning is used to guide *hardware* architecture design and to provide an overview of his brilliant book "Learning Deep Learning".
Magnus:
• Is a Director of Architecture at NVIDIA (he's been there 12 years!)
• Previously worked at Samsung and Sun Microsystems.
• Was co-founder/CTO of the start-up SKOUT (acquired for $55m).
• Authored the epic, 700-page "Learning Deep Learning".
• Holds a Ph.D. in computer engineering from the Chalmers University of Technology and a masters in economics from Göteborg University.
Today’s episode has technical elements here and there but should largely be interesting to anyone who’s interested in hearing the latest trends in A.I., particularly deep learning, software and hardware.
In the episode, Magnus details:
• What hardware architects do.
• How ML can be used to optimize the design of computer hardware.
• The pedagogical approach of his exceptional deep learning book.
• Which ML users need to understand how ML models work.
• Algorithms inspired by biological evolution.
• Why Artificial General Intelligence won’t be obtained by increasing model parameters alone.
• Whether transformer models will entirely displace other deep learning architectures such as CNNs and RNNs.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Automating ML Model Deployment
Relative to training a machine learning model, getting it into production typically takes multiple times as much time and effort. Dr Doris Xin, the brilliant co-founder/CEO of Linea, has a near-magical, two-line solution.
In the episode, Doris details:
• How Linea reduces ML model deployment to two lines of Python code.
• The surprising extent of wasted computation she discovered when she analyzed over 3000 production pipelines at Google.
• Her experimental evidence that the total automation of ML model development is neither realistic nor desirable.
• What it’s like being the CEO of an exciting, early-stage tech start-up.
• Where she sees the field of data science going in the coming years and how you can prepare for it.
Today’s episode is more on the technical side so will likely appeal primarily to practicing data scientists, especially those that need to — or are interested in — deploying ML models into production.
Doris:
• Is co-founder and CEO of Linea, an early start-up that dramatically simplifies the deployment of machine learning models into production.
• Her alpha users include the likes of Twitter, Lyft, and Pinterest.
• Her start-up’s mission was inspired by research she conducted as a PhD student in computer science at the University of California, Berkeley.
• Previously she worked in research and software engineering roles at Google, Microsoft, Databricks, and LinkedIn.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Collaborative, No-Code Machine Learning
Emerging tools allow real-time, highly visual collaboration on data science projects — even in ways that allow those who code and those who don't to work together. Tim Kraska fills us in on how ML models enable this.
Tim:
• Is Associate Professor in the revered CSAIL lab at the Massachusetts Institute of Technology.
• Co-founded Einblick, a visual data computing platform that has received $6m in seed funding.
• Was previous a professor at Brown University, a visiting researcher at Google, and a postdoctoral researcher at Berkeley.
• Holds a PhD in computer science from ETH Zürich in Switzerland.
Today’s episode gets into technical aspects here and there, but will largely appeal to anyone who’s interested in hearing about the visual, collaborative future of machine learning.
In this episode, Tim details:
• How a tool like Einblick can simultaneously support folks who code as well as folks who’d like to leverage data and ML without code.
• How this dual no-code/Python code environment supports visual, real-time, click-and-point collaboration on data science projects.
• The clever database and ML tricks under the hood of Einblick that enable the tool to run effectively in real time.
• How to make data models more widely available in organizations.
• How university environments like MIT’s CSAIL support long-term innovations that can be spun out to make game-changing impacts.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
A.I. For Crushing Humans at Poker and Board Games
The first SuperDataScience episode filmed with a live audience! Award-winning researcher Dr. Noam Brown from Meta AI was the guest, filling us in on A.I. systems that beat the world's best at poker and other games.
We shot this episode on stage at MLconf in New York. This means that you’ll hear audience reactions in real-time and, near the end of the episode, many great questions from audience members once I opened the floor up to them.
This episode has some moments here and there that get deep into the weeds of machine learning theory, but for the most part today’s episode will appeal to anyone who’s interested in understanding the absolute cutting-edge of A.I. capabilities today.
In this episode, Noam details:
• What Meta AI (formerly Facebook AI Research) is, how it fits into Meta.
• His award-winning no-limit poker-playing algorithms.
• What game theory is and how he integrates it into his models.
• The algorithm he recently developed that can beat the world’s best players at “no-press” Diplomacy, a complex strategy board game.
• The real-world implications of his game-playing A.I. breakthroughs.
• Why he became a researcher at a big tech firm instead of academia.
Noam:
• Develops A.I. systems that can defeat the best humans at complex games that computers have hitherto been unable to succeed at.
• During his Ph.D. in computer science at Carnegie Mellon University, developed A.I. systems that defeated the top human players of no-limit poker — earning him a Science Magazine cover story.
• Also holds a master’s in robotics from Carnegie Mellon and a bachelor’s degree in math and computer science from Rutgers.
• Previously worked for DeepMind and the U.S. Federal Reserve Board.
Thanks to Alexander Holden Miller for introducing me to Noam and to Hannah Gräfin von Waldersee for introducing me to Alex!
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Open-Access Publishing
This week Dr. Amy Brand, the pioneering Director of The MIT Press and executive producer of documentary films, leads discussion of the benefits of — and innovations in — open-access publishing.
In the episode, Amy details:
• What open-access means.
• Why open-access papers, books, data, and code are invaluable for data scientists and anyone else doing research and development.
• The new metadata standard she developed to resolve issues around accurate attribution of who did what for a given academic publication.
• How we can change the STEM fields to be welcoming to everyone, including historically underrepresented groups.
• What it’s like to devise and create an award-winning documentary film.
Amy:
• Leads one of the world’s most influential university presses as the Director and Publisher of the MIT Press.
• Created a new open-access business model called Direct to Open.
• Is Co-Founder of Knowledge Futures Group, a non-profit that provides technology to empower organizations to build the digital infrastructure required for open-access publishing.
• Launched MIT Press Kids, the first university+kids publishers collab.
• Was the executive producer of "Picture A Scientist", a documentary that was selected to premiere at the prestigious Tribeca Film Festival and was recognized with the 2021 Kavli Science Journalism Award.
• She holds a PhD in Cognitive Science from MIT.
Today’s episode is well-suited to a broad audience, not just data scientists.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
AGI: The Apocalypse Machine
Jeremie Harris's work on A.I. could dramatically alter your perspective on the field of data science and the bewildering — perhaps downright frightening — impact you and A.I. could make together on the world.
Jeremie:
• Recently co-founded Mercurius, an A.I. safety company.
• Has briefed senior political and policy leaders around the world on long-term risks from A.I., including senior members of the U.K. Cabinet Office, the Canadian Cabinet, as well as the U.S. Departments of State, Homeland Security and Defense.
• Is Host of the excellent Towards Data Science podcast.
• He previously co-founded SharpestMinds, a Y Combinator-backed mentorship marketplace for data scientists.
• He proudly dropped out of his quantum mechanics PhD to found SharpestMinds.
• He hold a Master’s in biological physics from the University of Toronto.
In this episode, Jeremie details:
• What Artificial General Intelligence (AGI) is
• How the development of AGI could happen in our lifetime and could present an existential risk to humans, perhaps even to all life on the planet as we know it.
• How, alternatively, if engineered properly, AGI could herald a moment called the singularity that brings with it a level of prosperity that is not even imaginable today.
• What it takes to become an AI safety expert yourself in order to help align AGI with benevolent human goals
• His forthcoming book on quantum mechanics
• Why almost nobody should do a PhD
Today’s episode is deep and intense, but as usual it does still have a lot of laughs, and it should appeal broadly, no matter whether you’re a technical data science expert already or not.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Clem Delangue on Hugging Face and Transformers
In today's SuperDataScience episode, Hugging Face CEO Clem Delangue fills us in on how open-source transformer architectures are accelerating ML capabilities. Recorded for yesterday's ScaleUp:AI conference in NY.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
How to Rock at Data Science — with Tina Huang
Can you tell I had fun filming this episode with Tina Huang, YouTube data science superstar (293k subscribers)? In it, we laugh while discussing how to get started in data science and her learning/productivity tricks.
Tina:
• Creates YouTube videos with millions of views on data science careers, learning to code, SQL, productivity, and study techniques.
• Is a data scientist at one of the world's largest tech companies (she keeps the firm anonymous so she can publish more freely).
• Previously worked at Goldman Sachs and the Ontario Institute for Cancer Research.
• Holds a Masters in Computer and Information Technology from the University of Pennsylvania and a bachelors in Pharmacology from the University of Toronto
In this episode, Tina details:
• Her guidance for preparing for a career in data science from scratch.
• Her five steps for consistently doing anything.
• Her strategies for learning effectively and efficiently.
• What the day-to-day is like for a data scientist at one of the world’s largest tech companies.
• The software languages she uses regularly.
• Her SQL course.
• How her science and computer science backgrounds help her as a data scientist today.
Today’s episode should be appealing to a broad audience, whether you’re thinking of getting started in data science, are already an experienced data scientist, or you’re more generally keen to pick up career and productivity tips from a light-hearted conversation.
Thanks to Serg Masís, Brindha Ganesan and Ken Jee for providing questions for Tina... in Ken's case, a very silly question indeed.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Engineering Data APIs
How you design a data API from scratch and how a data API can leverage machine learning to improve the quality of healthcare delivery are topics covered by Ribbon Health CTO Nate Fox in this week's episode.
Ribbon Health is a New York-based API platform for healthcare data that has raised $55m, including from some of the biggest names in venture capital like Andreessen Horowitz and General Catalyst.
Prior to Ribbon, Nate:
• Worked as an Analytics Engineer at the marketing start-up Unified.
• Was a Product Marketing Manager at Microsoft.
• Obtained a mechanical engineering degree from the Massachusetts Institute of Technology and an MBA from Harvard Business School.
In this episode, Nate details:
• What APIs ("application programming interfaces") are.
• How you design a data API from scratch.
• How Ribbon Health’s data API leverages machine learning models to improve the quality of healthcare delivery.
• How to ensure the uptime and reliability of APIs.
• How scientists and engineers can make a big social impact in health technology.
• His favorite tool for easily scaling up the impact of a data science model to any number of users.
• What he looks for in the data scientists he hires.
Today’s episode has some technical data science and software engineering elements here and there, but much of the conversation should be interesting to anyone who’s keen to understand how data science can play a big part in improving healthcare.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
GPT-3 for Natural Language Processing
With its human-level capacity on tasks as diverse as question-answering, translation, and arithmetic, GPT-3 is a game-changer for A.I. This week's brilliant guest, Melanie Subbiah, was a lead author of the GPT-3 paper.
GPT-3 is a natural language processing (NLP) model with 175 billion parameters that has demonstrated unprecedented and remarkable "few-shot learning" on the diverse tasks mentioned above (translation between languages, question-answering, performing three-digit arithmetic) as well as on many more (discussed in the episode).
Melanie's paper sent shockwaves through the mainstream media and was recognized with an Outstanding Paper Award from NeurIPS (the most prestigious machine learning conference) in 2020.
Melanie:
• Developed GPT-3 while she worked as an A.I. engineer at OpenAI, one of the world’s leading A.I. research outfits.
• Previously worked as an A.I. engineer at Apple.
• Is now pursuing a PhD at Columbia University in the City of New York specializing in NLP.
• Holds a bachelor's in computer science from Williams College.
In this episode, Melanie details:
• What GPT-3 is.
• Why applications of GPT-3 have transformed not only the field of data science but also the broader world.
• The strengths and weaknesses of GPT-3, and how these weaknesses might be addressed with future research.
• Whether transformer-based deep learning models spell doom for creative writers.
• How to address the climate change and bias issues that cloud discussions of large natural language models.
• The machine learning tools she’s most excited about.
This episode does have technical elements that will appeal primarily to practicing data scientists, but Melanie and I put an effort into explaining concepts and providing context wherever we could so hopefully much of this fun, laugh-filled episode will be engaging and informative to anyone who’s keen to learn about the start of the art in natural language processing and A.I.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.