Today we're diving into the techno-optimistic vision of Dario Amodei, the CEO of Anthropic. Published in October, Dario’s 15,000-word article, Machines of Loving Grace: How AI Could Transform the World for the Better, is an exciting read particularly if you’re bringing data science and machine learning to life.
Read MoreFiltering by Category: YouTube
PyTorch Lightning, Lit-Serve and Lightning Studios, with Dr. Luca Antiga
Lightning AI makes tons of tools that speed A.I. model dev and deployment, including the wildly popular open-source library PyTorch Lightning. Today, hear from hands-on CTO Dr. Luca Antiga how all the magic happens ⚡️
More on Luca:
CTO of Lightning AI, which (as one of world’s hottest startups developing A.I. tools) have raised over $80m in venture capital.
Is also CTO of OROBIX, an A.I. services company that Luca co-founded 15 years ago.
Holds a PhD in biomedical engineering from Politecnico di Milano… and did his postdoc at the Robarts Research Institute in London, Ontario (coincidentally around the same time I was doing brain-imaging research there).
Today’s episode will probably appeal most to hands-on practitioners like data scientists, software developers and ML engineers, but any tech-savvy professional could find it valuable.
In today’s episode, Luca details:
How Lightning AI's suite of tools (in addition to PyTorch Lightning, this includes Lightning Studios, LitServe and the Thunder Compiler) is making A.I. development faster and easier.
The rise of small language models and their potential to rival LLMs.
His journey from biomedical imaging to deep learning pioneer.
How software developer’s work will be transformed by A.I. in the coming years.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The “A.I.” Nobel Prizes (in Physics and Chemistry??)
A.I. was center stage at the 2024 Nobel Prizes, with Demis Hassabis sharing the Chemistry prize and Geoff Hinton sharing the Physics prize. Chem and Physics seems weird for A.I. though, no? Today's episode explains.
Read MoreNeuroscience Fueled by ML, with Prof. Bradley Voytek
Today's guest is the extraordinarily intelligent and well-spoken UC San Diego theoretical neuroscience professor, Bradley Voytek. He reveals how AI/ML is accelerating our understanding of the brain.
More on Brad:
• Professor in UC San Diego's Department of Cognitive Science, Data Science Institute, and the Neurosciences Graduate Program.
• Joined Uber as their first data scientist, when it was a 10-person startup, helping build their data science strategy and team.
• Outreach work has appeared in Scientific American, NPR... and Comic-Con!
• Co-authored the amusing book "Do Zombies Dream of Undead Sheep?"
Today’s episode has some brief exchanges that will appeal most to hands-on practitioners, but should overall be fascinating to anyone.
In today’s episode, Brad details:
• How large-scale data science and machine learning are accelerating neuroscience research.
• Discoveries his lab has recently made that overturn nearly a century of neuroscience doctrine.
• Insights on structuring data science education to balance technical skills with creative, practical problem-solving.
• Lessons from using data science to optimize Uber's early ride-prediction algorithms.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Are “Citizen Data Scientists” A Myth? With Keith McCormick
In a recent episode, Nick Elprin and I laughed that "citizen data scientists" don't exist. Keith McCormick joins me today to eloquently rebut us and demonstrate the clear value of low-code/no-code tools.
Keith is:
• Data Science Principal at the enterprise A.I. consultancy Further.
• Creator of dozens LinkedIn Learning courses on machine learning and A.I. with, in aggregate, over a million students!
• Author of four statistics books.
Today’s short episode should be of interest to just about any listener. In it, Keith details:
• Common circumstances where low-code/no-code data science tools are the best option for you, even if you are a coding whiz.
• Whether citizen data scientists are myth or reality.
• How AutoML fits into the data science workflow - and why it won't replace data science teams.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Polars: Past, Present and Future, with Polars Creator Ritchie Vink
Because of it's stunningly fast speed, Polars is an extremely popular open-source library for DataFrame operations in Python. Kinda unreal to have Ritchie Vink, Polars' creator, as today's guest!
Ritchie:
• Is CEO and Co-Founder of Polars, Inc., a startup that has raised $4m in seed funding to support his Polars open-source project.
• Previously worked as an ML Engineer, Data Scientist and Data Engineer at companies like adidas and KLM Royal Dutch Airlines.
• Holds a Master’s in Structural Engineering and worked as a civil engineer prior to catching the data-science bug.
Today’s episode will appeal most to hands-on practitioners like data scientists and ML engineers. In it, Ritchie details:
• How Polars regularly achieves 5-20x (sometimes 100x!) speed improvements over Pandas for most DataFrame operations.
• The Eager and Lazy execution APIs Polars offers and when you should use one or the other.
• Ritchie's vision for scaling Polars to handle massive distributed datasets.
• How we can continue to make data-processing efficiency gains even as Moore's Law slows down.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
In Case You Missed It in September 2024
Another month, another set of invaluable conversations on the SuperDataScience Podcast I host. ICYMI, today's episode highlights the most fascinating moments from September.
The specific conversation highlights included in today's episode are:
Posit PBC engineering manager Dr. Julia Silge explains why Positron, the next-generation IDE she's leading development of, is better-suited to data scientists than any existing IDE.
PyTorch expert Luka Anicin provides his top tips for training more accurate and compute-efficient ML models.
Exceptional open-source developer Marco Gorelli on why Polars is anywhere from 10 to 100x faster than Pandas, the incumbent Python library for working with DataFrames.
Microsoft's Marck Vaisman on what companies hiring data scientists should be looking for... as opposed to what the typically (and mistakenly!) look for today.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Data Contracts: The Key to Data Quality, with Chad Sanderson
Before talking to Chad Sanderson, I had never heard of Data Contracts. Now, I'm a proponent of how critical they are for data quality within any platform. Listen in and you may become a proponent too!
Chad is our guest in today's episode. He's:
• An extremely smooth communicator of technical information.
• CEO and Co-Founder of Gable, a platform for data teams that has raised $7m in seed funding.
• Chief Operator of the non-profit Data Quality Camp.
• Author of the forthcoming O'Reilly book “Data Contracts”.
• His informative social-media posts on Data Contracts have enabled him to amass over 80,000 followers on LinkedIn alone.
Today’s episode will appeal most to folks who work with data hands-on or who are involved in management roles that oversee data flows. Init, Chad details:
• What data contracts are.
• The critical concept of "shifting left" in data quality and governance.
• How data debt accumulates and leads to "spaghetti" data architectures.
• Why data quality is fundamentally a change-management problem.
Thanks to Emily Pastewka for suggesting Chad as a guest on the show!
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Virtual Humans and AI Clones, with Natalie Monbiot
Today, the clever and astoundingly well-spoken Natalie Monbiot provides a fascinating, mind-expanding episode on virtual humans, A.I. clones and the emerging virtual-human economy.
Natalie:
Is Head of Strategy and a Founding Team member of Hour One, a leader in virtual-human video generation that raised $20m in a Series A led by Insight Partners.
Through her own consultancy, EKLEKTIK, she advises virtual-human and A.I.-clone companies.
Regularly speaks at the world's largest conferences, including Web Summit and SXSW.
Holds a Master's in Languages and Literature from the University of Oxford.
Today's episode will of interest to everyone. In it, Natalie details:
What virtual humans are.
How virtual humans will buy us time and unleash a virtual-human economy.
The ethical quandaries and challenges associated with creating virtual twins.
What distinguishes virtual humans from deep fakes.
(P.S.: This is the first time we've ever shot an episode with three video cameras... if you watch the video version, let me know if you think it's worth the extra effort and investment!)
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
NotebookLM: Jaw-Dropping Podcast Episodes Generated About Your Documents
Today’s episode topic is on Google’s newly-released (and frankly sensational) product NotebookLM. All you need is a Google login, which is as easy as having a Gmail account. Use of NotebookLM is likewise totally free.
Read MoreThe Skills You Need to Be an Effective Data Scientist, with Marck Vaisman
Based on extensive research and analytical evaluations, in today's episode Marck Vaisman details all the skills that are essential for today's data professional.
Marck:
• Has been at Microsoft for seven years; for 5+ years, he’s been a Senior Cloud Solutions Architect, specializing in data, data science and AI/ML.
• For nearly a decade he’s also been an adjunct professor at both Georgetown University and The George Washington University, teaching graduate-level courses on math, stats, analytics and decision sciences.
• Co-Founded a non-profit in Washington, DC that runs both the Data Science DC and Statistical Programming DC Meetups.
• Holds a Bachelor's in Mechanical Engineering from Boston University and an MBA from Vanderbilt University.
Today’s episode will be of interest to anyone who is, manages, or aspires to be a data professional.
In today’s episode, Marck details:
• The skills, competencies and personas that data scientists and related professionals (such as analysts, data engineers, ML engineers and A.I. engineers) can have.
• The academic research on why “data scientist” is such a difficult job title to define.
• A comprehensive characterization of the essential skills that every data professional needs to be effective and the skills that allow you to specialize as a particular subtype of data scientist.
• The implications of all of this for both folks hunting for a data role and the companies that are looking to hire them.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
OpenAI's o1 "Strawberry" Models
Today’s episode, which, given the gravity of the event, could of course be none other than OpenAI’s new o1 series of models, which represent a tremendous leap forward in AI capabilities.
Read MorePyTorch: From Zero to Hero, with Luka Anicin
Today's episode is on Python's most popular auto-differentiation library, PyTorch, and how you can use it to design, train and deploy deep neural nets, including LLMs. Acclaimed PyTorch instructor Luka Anicin is our guide.
Luka:
Is one of Udemy’s all-time bestselling instructors on A.I.; over 500,000 students have taken his courses.
His latest course, available exclusively at SuperDataScience.com, is called “PyTorch: From Zero to Hero”.
CEO of full-lifecycle A.I. consultancy Datablooz.
Holds a Bachelor’s in Computer Science, a Master’s in Data Science and is nearing completion of his PhD in Applied A.I.
Today’s episode will probably appeal most to hands-on practitioners like data scientists, software developers and ML engineers.
In it, Luka details:
What the popular Python library PyTorch is for.
Why you would select PyTorch over TensorFlow or Scikit-learn.
The tensor building blocks PyTorch provides for designing, training and deploying state-of-the-art deep neural networks, including Large Language Models (LLMs).
His top tips for accurate and efficient deep learning.
Guidance on PyTorch portfolio projects.
Real-world PyTorch case-studies from his experience leading an A.I. consultancy.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The Positron IDE, Tidy NLP and MLOps with Dr. Julia Silge
Prepare to have your brain tickled by Dr. Julia Silge. In today's episode, Julia details the IDE she's been developing for data scientists, "Tidy" NLP, and open-source libraries that make MLOps a breeze.
More on Julia:
• Engineering Manager at Posit PBC (makers of RStudio... and the company formerly known as RStudio).
• Authored the bestselling O'Reilly books “Text Mining with R” and “Tidy Modeling with R".
• Previously worked as a Data Scientist at Stack Overflow and Datassist.
• Prior to joining industry, was an academic researcher and professor at Yale University.
• Holds a PhD in Astronomy from The University of Texas at Austin.
Today’s episode will probably appeal most to hands-on practitioners like data scientists, software developers and ML engineers. In it, Julia details:
• The brand-new IDE Positron (free to use and source-available) that she’s been developing.
• Her favorite LLMs for code generation.
• The open-source software libraries that make MLOps easy.
• Her top tips for effective Natural Language Processing, including when more traditional NLP techniques should be used instead of an LLM.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Explaining AGI to a 94-Year-Old
In today's short episode, I explain "data", "data science", "A.I." and AGI to a 94-year-old woman (my brilliant grandmother) who previously had no familiarity with the terms.
Perhaps the episode will be helpful to folks who are unfamiliar with any of these terms themselves, or to folks who'd like ideas for how to explain any of them to laypeople.
("AGI" is Artificial General Intelligence, btw!)
The "Super Data Science Podcast with Jon Krohn" is available on your favorite podcasting platform and the video version (which today is simply an audio waveform!) is on YouTube. Today's episode is #816.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
DataFrame Operations 100x Faster than Pandas, with Marco Gorelli
Today's episode is all about Polars — the hot library for Python that offers up to 100x speedups for DataFrame operations relative to pandas. Marco Gorelli, a core Polars developer, is our gifted guide.
Marco is a tremendously talented communicator of complex technical topics, making him the perfect guest for this highly technical episode. He:
• Is a core developer of the popular Python libraries pandas and Polars.
• Is the creator of the Narwhals library.
• Has spoken at several major Python conferences (such as PyData), taught Polars professionally, and wrote the first complete Polars plugins tutorial.
• Currently works as Senior Software Engineer at Quansight Labs.
• Previously, worked as a data scientist and was one of the prize winners (from amongst >100,000 entrants!) of the M6 forecasting competition.
• Holds a Master’s in Mathematics and the Foundations of Computer Science from the University of Oxford.
Today’s episode will appeal primarily to hands-on technical folks like data scientists, ML engineers and software developers.
In today’s episode, Marco details:
• What the hot, fast-growing Polars library for working with DataFrames in Python is (it already has 65m downloads and 28k GitHub stars).
• How Polars offers up to 100x speed-ups relative to Pandas on DataFrame operations.
• How the lightweight, dependency-free Narwhals package he created allows for easy compatibility between different DataFrame libraries such as Polars and Pandas.
• How he got addicted to open-source development.
• The simple trick he used to be a prize-winner in super-popular forecasting competitions.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Summer Reflections
This week, I’m enjoying the tail end of the northern-hemisphere summer by spending time with my family.
Read MoreSolving Business Problems Optimally with Data, with Jerry Yurchisin
For many real-world commercial problems, the best approach is not machine learning or statistics; it's Mathematical Optimization. In today's episode, hear all about optimization from the guru Jerome Yurchisin.
Jerry's an extraordinarily clear communicator of complex topics and a world-leading expert on real-world applications of mathematical optimization. He:
• Works as a Data Science Strategist at Gurobi Optimization, a leading decision-intelligence company that provides mathematical optimization solutions to the likes of Uber, Air France and the NFL (indeed, a wild 8 out of 10 Fortune 10 companies use Gurobi!)
• Previously spent eight years as a mathematical consultant where he paired mathematical optimization with machine learning, statistics and simulation to inform decision-making.
• He was also previously an instructor at the University of North Carolina at Chapel Hill, where he obtained his Master’s in Operations Research and Statistics.
• He holds an additional Master’s in Applied Math from Ohio University.
Today’s episode may appeal most to hands-on practitioners like data scientists and ML engineers, but it does also have tons of content that will be of interest to anyone who’d like to leverage data to make better commercial decisions or optimize commercial processes.
In this episode, Jerry details:
• What mathematical optimization is.
• The kinds of real-world problems where mathematical optimization is a far better approach than a machine learning or statistics approach.
• The history of mathematical optimization including why it wasn’t popular until recently.
• The cutting-edge hardware and software innovations in mathematical optimization today.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The AI Scientist: Towards Fully Automated, Open-Ended Scientific Discovery
A team of researchers from Sakana AI, a Japanese AI startup founded last year by Google alumni and that reportedly was valued at over a $1 billion in June, this week published a paper titled "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery" that is making big waves and could revolutionize how we conduct scientific research.
Read MoreScaling Data Science Teams Effectively, with Nick Elprin
Today's episode with (extremely intelligent and wildly successful ML entrepreneur) Nick Elprin covers efficiently scaling data science teams and ensuring A.I. projects are commercial wins 🥇
Nick:
• Is Co-Founder and CEO of Domino Data Lab, a colossal Bay Area startup that has raised over $200m in venture capital from some of the world’s most prestigious VC firms.
• Prior to co-founding Domino Data Lab 11 years ago, he worked as a technologist at Bridgewater Associates, the well-known hedge fund.
• He holds both a BA and MS in Computer Science from Harvard University.
Today’s episode may appeal most to technical folks but has tons of content that will be of interest to anyone in or interested in commercializing data science or A.I.
In this episode, Nick details:
• How organizations can leverage enterprise platforms to efficiently scale their data science teams and data science workflows.
• The exact team size at which integrating such a platform becomes worthwhile.
• How to ensure A.I. projects are commercially successful.
• The tech stack they use at Domino to create such a performant platform.
• His top tip for growing your own colossal data science startup.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.