The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, I cover this ratio and the LLMs that have arisen from it (incl. the new Cerebras-GPT family).
Read MoreFiltering by Category: SuperDataScience
Pandas for Data Analysis and Visualization
Today's episode is jam-packed with practical tips on using the Pandas library in Python for data analysis and visualization. Super-sharp Stefanie Molin — a bestselling author and sought-after instructor on these topics — is our guide.
Stefanie:
• Is the author of the bestselling book "Hands-On Data Analysis with Pandas".
• Provides hands-on pandas and data viz tutorials at top industry conferences.
• Is a software engineer and data scientist at Bloomberg, the financial data giant, where she tackles problems revolving around data wrangling/visualization and building tools for gathering data.
• Holds a degree in operations research from Columbia University as well as a masters in computer science, with an ML specialization, from Georgia Tech.
Today’s episode is intended primarily for hands-on practitioners like data analysts, data scientists, and ML engineers — or anyone that would like to be in a technical data role like these in the future.
In this episode, Stefanie details:
• Her top tips for wrangling data in pandas.
• In what data viz circumstances you should use pandas, matplotlib, or Seaborn.
• Why everyone who codes, including data scientists, should develop expertise in Python package creation as well as contribute to open-source projects.
• The tech stack she uses in her role at Bloomberg.
• The productivity tips she honed by simultaneously working full-time, completing a masters degree and writing a bestselling book.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Parameter-Efficient Fine-Tuning of LLMs using LoRA (Low-Rank Adaptation)
Large Language Models (LLMs) are capable of extraordinary NLP feats, but are so large that they're too expensive for most organizations to train. The solution is Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA).
This discussion comes in the wake of introducing models like Alpaca, Vicuña, GPT4All-J, and Dolly 2.0, which demonstrated the power of fine-tuning with thousands of instruction-response pairs.
Training LLMs, even those with tens of billions of parameters, can be prohibitively expensive and technically challenging. One significant issue is "catastrophic forgetting," where a model, after being retrained on new data, loses its ability to perform previously learned tasks. This challenge necessitates a more efficient approach to fine-tuning.
PEFT
By reducing the memory footprint and the number of parameters needed for training, PEFT methods like LoRA and AdaLoRA make it feasible to fine-tune large models on standard hardware. These techniques are not only space-efficient, with model weights requiring only megabytes of space, but they also avoid catastrophic forgetting, perform better with small data sets, and generalize better to out-of-training-set instructions. They can also be applied to other A.I. use cases — not just NLP — such as machine vision.
LoRA
LoRA stands out as a particularly effective PEFT method. It involves inserting low-rank decomposition matrices into each layer of a transformer model. These matrices represent data in a lower-dimensional space, simplifying computational processing. The key to LoRA's efficiency is freezing all original model weights except for the new low-rank matrices. This strategy reduces the number of trainable parameters by approximately 10,000 times and lowers the memory requirement for training by about three times. Remarkably, LoRA sometimes not only matches but even outperforms full-model training in certain scenarios. This efficiency does not come at the cost of effectiveness, making LoRA an attractive option for fine-tuning LLMs.
AdaLoRA
AdaLoRA, a recent innovation by researchers at Georgia Tech, Princeton, and Microsoft, builds on the foundations of LoRA. It differs by adaptively fine-tuning parts of the transformer architecture that benefit most from it, potentially offering enhanced performance over standard LoRA.
These developments in PEFT and the emergence of tools like LoRA and AdaLoRA mark an incredibly exciting and promising time for data scientists. With the ability to fine-tune large models efficiently, the potential for innovation and application in the field of AI is vast and continually expanding.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Taipy, the open-source Python application builder
An A.I. expert for nearly 40 years, Vincent Gosselin adores the field's lingua franca, Python. In today's episode, hear how he created its open-source Taipy library so you can easily build Python-based web apps and scalable, reusable data pipelines.
Vincent:
• Is CEO and co-founder of taipy.io, an open-source Python library that works up and down the stack to both easily build web applications and back-end data pipelines.
• Having obtained his Masters in CS and A.I. from the Université Paris-Saclay in 1987, he’s amassed a wealth of experience across a broad range of industries, including semiconductors, finance, airspace, and logistics.
• Has held roles including Director of Software Development at ILOG, Director of Advanced Analytics at IBM, and VP of Advanced Analytics at DecisionBrain.
Today’s episode will appeal primarily to hands-on practitioners who are keen to hear about how they can be accelerating their productivity in Python, whether it’s on the front end (to build a data-driven web-application) or on the back end (to have scalable, reusable and maintainable data pipelines). That said, Vincent’s breadth of wisdom — honed over his decades-long A.I. career — may prove to be fascinating and informative to technical and non-technical listeners alike.
In this episode, Vincent details:
• The critical gaps in Python development that led him to create Taipy.
• How much potential there is for data-pipeline engineering to be improved.
• How shifting toward lower-code environments can accelerate Python development without sacrificing any flexibility.
• The 50-year-old programming language that was designed for A.I. and that he was nostalgic for until Python emerged on the scene.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Open-source “ChatGPT”: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0
Want a GPT-4-style model on your own hardware and fine-tuned to your proprietary language-generation tasks? Today's episode covers the key open-source models (Alpaca, Vicuña, GPT4All-J, and Dolly 2.0) for doing this cheaply on a single GPU 🤯
We begin with a retrospective look at Meta AI's LLaMA model, which was introduced in episode #670. LLaMA, with its 13 billion parameters, achieves performance comparable to GPT-3 while being significantly smaller and more manageable. This efficiency makes it possible to train the model on a single GPU, democratizing access to advanced AI capabilities.
The focus then shifts to four models that surpass LLaMA in terms of power and sophistication: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0. Each of these models presents a unique blend of innovation and practicality, pushing the boundaries of what's possible with AI:
Alpaca
Developed by Stanford researchers, Alpaca is an evolution of the 7 billion parameter LLaMA model, fine-tuned with 52,000 examples of instruction-following natural language. This model excels in mimicking GPT-3.5's instruction-following capabilities, offering high performance at a fraction of the cost and size.
Vicuña
Vicuña, a product of collaborative research across multiple institutions, builds on both the 7 billion and 13 billion parameter LLaMA models. It's fine-tuned on 70,000 user-shared ChatGPT conversations from the ShareGPT repository, achieving GPT-3.5-like performance with unique user-generated content.
GPT4All-J
GPT4All-J, released by Nomic AI, is based on EleutherAI's open source 6 billion parameter GPT-J model. It's fine-tuned with an extensive 800,000 instruction-response dataset, making it an attractive option for commercial applications due to its open-source nature and Apache license.
Dolly 2.0
Dolly 2.0, from database giant Databricks, builds upon EleutherAI's 12 billion parameter model. It's fine-tuned with 15,000 human-generated instruction response pairs, offering another open source, commercially viable option for AI applications.
These models represent a significant shift in the AI landscape, making it economically feasible for individuals and small teams to train and deploy powerful language models. With a few hundred to a few thousand dollars, it's now possible to create proprietary, ChatGPT-like models tailored to specific use cases.
The advancements in AI models that can be trained on a single GPU mark a thrilling era in data science. These developments not only showcase the rapid progression of AI technology but also significantly lower the barrier to entry, allowing a broader range of users to explore and innovate in the field of artificial intelligence.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Cloud Machine Learning
As ML models, particularly LLMs, have scaled up to having trillions of trainable parameters, cloud compute platforms have never been more essential. In today's episode, Hadelin and Kirill cover how data scientists can make the most of the cloud.
Kirill:
• Is Founder and CEO of SuperDataScience, an e-learning platform.
• Founded the SuperDataScience Podcast in 2016 and hosted the show until he passed me the reins in late 2020.
Hadelin:
• Was a data engineer at Google before becoming a content creator.
• Took a break from Data Science content in 2020 to produce and star on Bollywood.
Together, Kirill and Hadelin:
• Are the most popular data science instructors on the Udemy platform, with over two million students.
• Have created dozens of data science courses.
• Recently returned from a multi-year course-creation hiatus to publish their “Machine Learning in Python: Level 1" course as well as their brand-new course on cloud computing.
Today’s episode is all about the latter so will appeal primarily to hands-on practitioners like data scientists who are keen to be introduced to — or brush up upon — analytics and ML in the cloud.
In this episode, Kirill and Hadelin detail:
• What cloud computing is.
• Why data scientists increasingly need to know how to use the key cloud computing platforms such as AWS, Azure, and the Google Cloud Platform.
• The key services the most popular cloud platform AWS offers, particularly with respect to databases and machine learning.
*Note that it is a coincidence that AWS sponsored this show with a promotional message about their hardware accelerators. Kirill and Hadelin did not receive any compensation for developing content on AWS nor for covering AWS topics in this episode.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
LLaMA: GPT-3 performance, 10x smaller
By training (relatively) small LLMs for (much) longer, Meta AI's LLaMA architectures achieve GPT-3-like outputs at as little as a thirteenth of GPT-3's size. This means cost savings and much faster execution time.
LLaMA, a clever nod to LLMs (Large Language Models), is Meta AI's latest contribution to the AI world. Based on the Chinchilla scaling laws, LLaMA adopts a principle that veers away from the norm. Unlike its predecessors, which boasted hundreds of millions of parameters, LLaMA emphasizes training smaller models for longer durations to achieve enhanced performance.
The Chinchilla Principle in LLaMA
The Chinchilla scaling laws, introduced by Hoffmann and colleagues, postulate that extended training of smaller models can lead to superior performance. LLaMA, with its 7 billion to 65 billion parameter models, is a testament to this principle. For perspective, GPT-3 has 175 billion parameters, making the smallest LLaMA model just a fraction of its size.
Training Longer for Greater Performance
Meta AI's LLaMA pushes the boundaries by training these relatively smaller models for significantly longer periods than conventional approaches. This method contrasts with last year's top models like Chinchilla, GPT-3, and PaLM, which relied on undisclosed training data. LLaMA, however, uses entirely open-source data, including datasets like English Common Crawl, C4, GitHub, Wikipedia, and others, adding to its appeal and accessibility.
LLaMA's Remarkable Achievements
LLaMA's achievements are notable. The 13 billion parameter model (LLaMA 13B) outperforms GPT-3 in most benchmarks, despite having 13 times fewer parameters. This implies that LLaMA 13 can offer GPT-3 like performance on a single GPU. The largest LLaMA model, 65B, competes with giants like Chinchilla 70B and PaLM, even preceding the release of GPT-4.
This approach signifies a shift in the AI paradigm – achieving state-of-the-art performance without the need for enormous models. It's a leap forward in making advanced AI more accessible and environmentally friendly. The model weights, though intended for researchers, have been leaked and are available for non-commercial use, further democratizing access to cutting-edge AI.
LLaMA not only establishes a new benchmark in AI efficiency but also sets the stage for future innovations. Building on LLaMA's foundation, models like Alpaca, Vicuna, and GPT4ALL have emerged, fine-tuned on thoughtful datasets to exceed even LLaMA's performance. These developments herald a new era in AI, where size doesn't always equate to capability.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Streaming, reactive, real-time machine learning
Real-time, reactive data processing and streaming machine learning: In today's episode, the positively brilliant researcher and entrepreneur Adrian Kosowski, PhD fills us in on what the future of ML will be like.
Adrian:
• Is Co-Founder and Chief Product Officer at Pathway, a framework for real-time, reactive data processing that is based in Paris.
• Has over 15 years of research experience, including 9 years at Inria (a prestigious French computer science center), leading to the co-authorship of over 100 articles in a range of fields (theoretical computer science, physics, and biology) covering topics like network science, distributed algorithms and complex systems.
• Previously co-founded and led business development for Spoj.com, a competitive programming platform used by millions of software developers.
• Obtained his PhD in Computer Science at the ripe old age of 20.
Adrian has also generously offered to ship a Pathway hoodie (to anywhere in the world!) to the first ten commenters on this post who request one!
Today’s episode will appeal primarily to hands-on practitioners like data scientists, ML engineers, and data engineers. However, we do our best to break down technical terms and provide concrete examples of topics so that anyone can enjoy learning about the cutting edge in training ML models.
In this episode, Adrian details:
• What streaming data processing is and why it’s superior in many ways to the batch training of ML models that historically dominated data science.
• How streaming data processing allows efficient, real-time model training.
• How reactive data processing enables data applications to react instantly and automatically to never-before-seen input data potentially saving firms vast sums.
• When a computer scientist should become a product leader.
• What programming languages Pathway selected for their platform & why.
• The big up-and-coming opportunity for data and ML start-ups.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
GPT-4: Apocalyptic stepping stone?
The final episode in our trilogy on GPT-4 is on the risks posed by the model today and the potentially existential risks posed by the models it paves the way for. Our guest for this is Jeremie Harris, a world leader on A.I. safety.
Jeremie:
• Is co-founder of Gladstone AI, an advisor to US and Canadian government entities on A.I. risk.
• Co-hosts the "Last Week in A.I.", the premier podcast on ML news.
• Wrote the new (released this week!) book "Quantum Physics Made Me Do It" that covers human consciousness and speculates on the future of A.I.
• Co-founded SharpestMinds, a Y Combinator-backed A.I.-career mentorship platform.
In today's episode, Jeremie details:
• How GPT-4 is a “dual-use technology” — capable of tremendous good but it can also be wielded malevolently.
• How RLHF — reinforcement learning from human feedback — has made GPT-4 outputs markedly more aligned with the outputs humans would like to see, but how this doesn’t necessarily mean we’re in the clear with respect to A.I. acting in the broader interest of humans.
• Emerging approaches for how we might ensure A.I. is aligned with humans, not only today but — critically — as machines overtake human intelligence, the “singularity” event that may occur in the coming decades, or even in the coming years.
The SuperDataScience GPT-4 trilogy is comprised of:
• #666 (last Friday): a ten-minute GPT-4 overview by yours truly.
• #667 (Tuesday): world-leading A.I. monetization expert Vin Vashishta on the unprecedented commercial opportunity of GPT-4.
• #668 (today): GPT-4 risks
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Harnessing GPT-4 for your Commercial Advantage
Episode two in our trilogy on GPT-4 is dedicated to how you can leverage GPT-4 to your commercial benefit. In it, I'm joined by Vin Vashishta — perhaps the best person on the planet for covering A.I. monetization.
Vin:
• Is Founder of V Squared, a consultancy that specializes in monetizing machine learning by helping Fortune 100 companies with A.I. strategy.
• Is the creator of a four-hour course on “GPT Monetization Strategy” which teaches how to build new A.I. products, startups, and business models with GPT models like ChatGPT and GPT-4.
• Is author of the forthcoming book “From Data To Profit: How Businesses Leverage Data to Grow Their Top and Bottom Lines”, which will be published by Wiley.
Today’s episode will be broadly appealing to anyone who’d like to drive commercial value with the powerful GPT-4 model that is taking the world by storm.
In this episode, Vin details:
• What makes GPT-4 so much more commercially useful than any previous A.I. model.
• The levels of A.I. capability that have been unleashed by GPT-4 and how we can automate or augment specific types of human tasks with these new capabilities.
• The characteristics that enable individuals and organizations to best take advantage of foundation models like GPT-4 enabling them overtake their competitors commercially.
The SuperDataScience GPT-4 trilogy is comprised of:
• #666 (last Friday): a ten-minute GPT-4 overview by yours truly.
• #667 (today): GPT-4 commercial opportunities.
• #668 (this Friday): world-leading A.I.-safety expert Jeremie Harris joins me to detail the (existential!) risks of GPT-4 and the models it paves the way for.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
One Million SuperDataScience Podcast Listens Per Quarter
We've cracked one million listens per quarter for the first time! No doubt buoyed by the mainstream A.I. fascination, but also thanks to our outstanding recent guests, our show had 1.06 million listens in Q1 2023 🍾
The chart shows episode downloads (on podcasting platforms) plus views (on YouTube) for each quarter since I took over as host of The SuperDataScience Podcast in January 2021.
Thank you for listening and providing thoughtful feedback on how we can improve the show. We have fantastic topics lined up for the coming weeks so I'm hopeful we can continue this growth trend in Q2. We're already off to a good start as the past week was — by some margin — the best week for listens in the show's history.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
GPT-4 Has Arrived
SuperDataScience episode #666 — appropriate for an algorithm that has folks (quixotically) signing a letter to pause all A.I. development. In this first episode of the GPT-4 trilogy; in ten minutes, I introduces GPT-4's staggering capabilities.
A Leap in AI Safety and Accuracy
GPT-4 marks a significant advance over its predecessor, GPT-3.5, in terms of both safety and factual accuracy. It is reportedly 82% less likely to respond with disallowed content and 40% more likely to produce factually correct responses. Despite improvements, challenges like sociodemographic biases and hallucinations persist, although they are considerably reduced.
Academic and Professional Exam Performance
The prowess of GPT-4 becomes evident when revisiting queries initially tested on GPT-3.5. Its ability to summarize complex academic content accurately and its human-like response quality are striking. In one test, GPT-4’s output was mistaken for human writing by GPTZero, an AI detection tool, underscoring its sophistication. In another test, the uniform bar exam, GPT-4 scored in the 90th percentile, a massive leap from GPT-3.5's 10th percentile.
Multimodality
GPT-4 introduces multimodality, handling both language and visual inputs. This capability allows for innovative interactions, like recipe suggestions based on fridge contents or transforming drawings into functional websites. This visual aptitude notably boosted its performance in exams like the Biology Olympiad, where GPT-4 scored in the 99th percentile.
The model also demonstrates proficiency in numerous languages, including low-resource ones, outperforming other major models in most languages tested. This linguistic versatility extends to its translation capabilities between these languages.
The Secret Behind GPT-4’s Success
While OpenAI has not disclosed the exact number of model parameters in GPT-4, it's speculated that they significantly exceed GPT-3's 175 billion. This increase, coupled with more and better-curated training data, and the ability to handle vastly more context (up to 32,000 tokens), are likely contributors to GPT-4's enhanced performance.
Reinforcement Learning from Human Feedback (RLHF)
GPT-4 incorporates RLHF, a method that refines its output based on user feedback, allowing it to align more closely with desired responses. This approach has already proven effective in previous models like InstructGPT.
GPT-4 represents a monumental step in AI development, balancing unprecedented capabilities with improved safety measures. Its impact is far-reaching, offering new possibilities in various fields and highlighting the importance of responsible AI development and use. As we continue to explore its potential, the conversation around AI safety and ethics becomes increasingly vital.
The SuperDataScience GPT-4 trilogy is comprised of:
• #666 (today): an introductory overview by yours truly
• #667 (Tuesday): world-leading A.I.-monetization expert Vin Vashishta joins me to detail how you can leverage GPT-4 to your commercial advantage
• #668 (next Friday): world-leading A.I.-safety expert Jeremie Harris joins me to detail the (existential!) risks of GPT-4 and the models it paves the way for
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
MIT Study: ChatGPT Dramatically Increases Productivity
With all of this ChatGPT and GPT-4 news, I was wondering whether these generative A.I. tools actually result in the productivity gains everyone supposes them to. Well, wonder no more…
Read MoreAstonishing CICERO negotiates and builds trust with humans using natural language
Meta AI's CICERO algorithm — which negotiates and build trust with humans to perform in the top decile at the game of Diplomacy — is (in my view) the most astounding A.I. feat yet. Hear all about it from Alexander.
As published in the prestigious academic journal Science in November, CICERO is capable of using natural-language conversation to coordinate with humans, develop strategic alliances, and ultimately win in Diplomacy, an extremely complex board game.
Excelling in a game with incomplete information and vastly more possible states of play than games previously conquered by A.I. like chess and go would be a wild feat in and of itself, but CICERO’s generative capacity to converse and negotiate in real-time with six other human players in order to strategize victoriously is the truly mind-boggling capability.
To detail for you how the game of Diplomacy works, why Meta chose to tackle this game with A.I., and how they developed a model that competes in the top decile of human Diplomacy players without any other players even catching a whiff that CICERO could possibly be a machine, my guest in today's episode is Alexander Holden Miller, a co-author of the CICERO paper.
Alex:
• Has been working in Meta AI’s Fundamental AI Research group, FAIR, for nearly eight years.
• Currently serves as a Senior Research Engineering Manager within FAIR.
• Has supported researchers working in most ML sub-domains but has been especially involved in conversational A.I. research and more recently reinforcement learning and planning.
• Holds a degree in Computer Science from Cornell University.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Designing Machine Learning Systems
Mega-bestselling author of the "Designing ML Systems" book, Chip Huyen, joined me to cover her top tips on, well, designing ML systems! ...as well as her burgeoning real-time ML startup. Can you tell we had a ton of fun?
Chip:
• Is Co-Founder of Claypot AI, a platform for real-time machine learning.
• Authored the book “Designing Machine Learning Systems”, which was published by O'Reilly Media and based on the Stanford University course she created and taught on the same topic.
• Also created and taught Stanford's “TensorFlow for Deep Learning” course.
• Previously worked as ML Engineer at data-centric development platform Snorkel AI and as a Senior Deep Learning Engineer at the chip giant NVIDIA.
• Runs an MLOps community on Discord with over 14k members.
• Her helpful posts have earned her over 160k followers on LinkedIn.
Today’s episode will probably appeal most to technical listeners like data scientists and ML engineers, but anyone involved in (or thinking of being involved in) the deployment of ML into real-life systems will learn a ton.
In this episode, Chip details:
• Her top tips for designing production-ready ML applications.
• Why iteration is key successfully deploying ML models.
• What real-time ML is and the kinds of applications it’s critical for.
• Why Large Language Models like ChatGPT and other GPT series architectures involve limited data science ingenuity but do involve enormous ML engineering challenges.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Five Ways to Use ChatGPT for Data Science
Back in Episode #646, we focused on how anyone can extract commercial value from ChatGPT today — whether ye be a technical data science practitioner or not. In today’s episode, it’s exclusively the technical practitioners’ turn: In today’s episode, I’ve got five specific ways that ChatGPT can be used for data science.
Use case #1 is code generation. While ChatGPT was designed primarily as a tool for generating natural language (while, in contrast, OpenAI’s Codex algorithm was designed explicitly for generating code — you can hear all about it in Episode #584), the friendly, conversational UI of ChatGPT nevertheless comes in handy for rapidly generating code. And it can do so in all of the primary software languages for data science, including Python, R, and SQL. ChatGPT’s code is not always going to be perfect, but for quick ideas on how you could be extracting features from your data, implementing an algorithm, or creating a data visualization, ChatGPT is a great tool for getting started.
Use case #2 is translating code between programming languages. Not only can ChatGPT convert your natural-language input into code, it can also translate between programming languages. So if you, for example, are expert at Python but unfamiliar with an R code snippet you found online that you’d like to understand and implement in Python, you could ask ChatGPT to convert the R code into Python for you. Because ChatGPT has training data from many different programming languages, you can now convert perhaps any unfamiliar code you come across into a familiar target programming language of your choice.
Use case #3 is code troubleshooting. Not only can ChatGPT help you with generating code, you can use it to explain errors that you’re coming across and provide suggestions as to how to fix it. You can even request ChatGPT to rewrite your code for you so that it’s bug-free.
Use case #4 is providing library suggestions. In Python or R, there are countless open-source libraries of code available to you. With ChatGPT, you can now quickly identify which library or libraries are best-suited to a particular task you’d like to perform with your code.
Finally, use case #5 is article summarization. A seemingly endless number of fascinating articles on machine learning innovations are published on ArXiV each week. Poring through each of the articles that interests you is likely to be impossible, but with ChatGPT you can instantly have articles summarized and key information extracted, making it much easier for you to stay on top of the latest data science developments.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Open-Source Tools for Natural Language Processing
In today's episode, the brilliant Vincent Warmerdam regales us with invaluable ideas and open-source software libraries for developing A.I. (particularly Natural Language Processing) applications. Enjoy!
Vincent:
• Is an ML Engineer at Explosion, the German software company that specializes in developer tools for A.I. and NLP such as spaCy and Prodigy.
• Is renowned for several open-source tools of his own, including Doubtlab.
• Is behind an educational platform called Calmcode that has over 600 short and conspicuously enjoyable video tutorials about software engineering concepts.
• Was Co-Founder and Chair of PyData Amsterdam.
• Has delivered countless amusing and insightful PyData talks.
• Holds a Masters in Econometrics and Operations Research from Vrije Universiteit Amsterdam (VU Amsterdam)).
Today’s episode will appeal primarily to technical listeners as it focuses primarily on ideas and open-source software libraries that are indispensible for data scientists, particularly those developing A.I. or NLP applications.
In this episode, Vincent details:
• The prompt recipes he developed to enable OpenAI GPT architectures to perform tremendously helpful NLP tasks such as data labeling.
• The super-popular open-source libraries he’s developed on his own as well as with Explosion.
• The software tools he uses daily including several invaluable open-source packages made by other folks.
• How both linguistics and operations research are extremely useful fields to be a better NLP practitioner and ML practitioner, respectively.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
SuperDataScience Podcast Audience Growth
Since I started hosting the SuperDataScience Podcast in Q1 of 2021, our audience has quadrupled, with episode downloads (plus YouTube views) now approaching one million per quarter. Thank you for listening!
I'm only a small part of the team required to release the high-quality episodes we do 104 times every year. The world-class people making the machine hum along behind the scenes are:
• Ivana Zibert: Podcast Manager
• Natalie Ziajski: Sales, Marketing, and my personal Operations Manager
• Mario Pombo: Audio & Video Production
• Serg Masís: Research
• Dr. Zara Karschay: Writer
• Sylvia Ogweng: Writer
• Kirill Eremenko: Founder, Co-Owner, Former Host
These people all rock and you rock for your support too! Armed with your invaluable ongoing feedback on episodes, I hope I can continue to learn what resonates most with you and that this growth can keep going. It's a great honor to serve you, our wonderful guests, and our episode sponsors.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
How to Build Data and ML Products Users Love
What makes people latch onto data products and come back for more? In today's episode, Brian T. O'Neill unveils the processes and teams that make data and A.I. products engaging and sticky for users.
Brian:
• Founded and runs Designing for Analytics, a consultancy that specializes in designing analytics and ML products so that they are adopted.
• Hosts the "Experiencing Data" podcast, an entertaining show that covers how to use product-development methodologies and UX design to drive meaningful user and business outcomes with data.
In today's episode, Brian details:
• What data product management is.
• Why so many data projects fail.
• How to develop machine learning-powered products that users love.
• The teams and skill sets required to develop successful data products.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
How to Learn Data Engineering
As data sets continue to grow exponentially, Data Engineering skills become increasingly essential — standalone or as part of Data Scientists' expertise. In today's episode, Andreas Kretz details how to Learn Data Engineering.
Andreas:
• Is the Founder of Learn Data Engineering, a platform through which he’s taught over a thousand students the theory and practice of data engineering.
• Has provided countless more folks with data engineering tips and tricks through his YouTube channel, which has over 10,000 subscribers.
• Worked for ten years at the German industrial giant Bosch, including as a data engineering team lead and data lab team lead.
• Holds a Computer Science degree from the Technical University of Applied Sciences Würzburg-Schweinfurt (THWS).
• With over 100,000 followers on LinkedIn, has twice been recognized as a Top Voice for Data Science and Analytics on the platform.
Today’s episode will appeal primarily to technical listeners particularly to data scientists that are keen to develop ever-more-critical data engineering skills.
In this episode, Andreas details:
• What data engineering is and how it relates to adjacent fields like data science, software engineering, and machine learning engineering.
• Why data engineering skills become increasingly essential to data scientists and data analysts with each passing year.
• What sets Senior Data Engineers apart from junior ones.
• His general process for tackling data engineering problems.
• The must-know data-engineering tools of today as well as the emerging ones you shouldn’t miss.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.