Today we're diving deep into a hugely impactful recent release: the release of Llama 3.2 by Meta.
For a bit of background context (and as the “3.2” in Llama 3.2 suggests), Meta has over the past few years been releasing more and more open-source models (all under the “Llama” brand) that aim to compete with the closed-source models released by the likes of OpenAI, Anthropic and Google. As you can hear about in Episode #806 released earlier this year, with the Llama 3.1 release, their gigantic 405B-parameter model made headlines because it was the first open-source LLM to actually be able to perform somewhat comparably to state-of-the-art closed-source LLMs like Claude 3.5 Sonnet and GPT-4. This latest “3.2” installment of the Llama family of models (released in the past week) in contrast brings groundbreaking advancements in edge AI (“small” LLMs) as well as multi-modal capabilities (specifically, vision capabilities), making open-source Generative AI more accessible and broadly useful than ever before.
In a bit more detail, Llama 3.2 introduces small-ish and medium-sized vision LLMs, with 11 billion and 90 billion parameters respectively. These models are pushing the boundaries of what's possible on vision tasks with open-source models. As someone who primarily works professionally with text-only LLMs, however, what I'm personally most excited about with Llama 3.2 is the lightweight, text-only models, which have merely 1 billion and 3 billion parameters. Unlike all other Llama models released previously, which were designed to run on big GPUs housed in data-center servers, these compact LLMs are designed to run on edge and mobile devices. This brings AI capabilities out of the cloud and onto your smartphone or tablet, which brings security and latency advantages that we’ll dig into next.
In terms of technical specs on these smaller models, the 1B and 3B LLMs both support an impressive context length of 128,000 tokens, which is pushing the frontier for on-device applications. Think summarization, following complex instructions, and rewriting tasks – all running locally on your device. This is a game-changer for privacy-conscious applications where you want to keep sensitive data processing on-device.
Let's make this tangible with a real-world example. Let’s say you’re developing a mobile app that needs to summarize the last 10 messages in a chat, extract action items, and schedule follow-up meetings. With these new Llama 3.2 models, you can do all of this processing right on your user's device. This means faster response times and enhanced privacy since sensitive message data never leaves your user’s phone. That’s, as the kids say, dope. And, according to Meta’s own tests (which we know to take with a pinch of salt), Llama 3.2 3B outperforms the incumbent edge-sized LLMs such as Google’s Gemma 2 2B and Microsoft’s Phi 3.5-mini on most benchmarks. The results for the Llama 3.2 1B are more mixed, which is to be expected when the LLM is so small.
Moving on from the super-lightweight models, let’s now dig into the vision capabilities of the larger models in the Llama 3.2 release. Both the 11B and 90B versions can understand images, including charts and graphs, though of course the 90B version performs much better because of its extra size, allowing it to outcompete (again, on Meta’s own tests) moderately-sized, low-cost multi-modal models from Anthropic (Claude 3 Haiku) and OpenAI (GPT-4o-mini) on both image and text evaluation benchmarks.
What this means is that now you can leverage (and even fine-tune) powerful, open-source LLMs on your own infrastructure for tasks like:
Analyzing a sales graph and telling you which month had the best performance.
Planning a hike and understanding the terrain based on an image of a map.
These models can bridge the gap between visual information and natural language understanding.
What's particularly exciting for developers is that these vision models are drop-in replacements for their text-only counterparts. This means you can easily upgrade existing Llama-based applications to handle image inputs without a complete overhaul of your codebase. It's this kind of thoughtful design that can really accelerate the adoption of more advanced AI capabilities in existing applications.
On top of the LLMs themselves, Meta is including other treats with Llama 3.2. Specifically, they're introducing the Llama Stack, a set of tools that simplify how developers work with Llama models across various environments. Whether you're deploying on a single node, in the cloud, or on-device, the Llama Stack aims to provide a turnkey solution.
Let's break down what the Llama Stack includes:
A CLI (command line interface) for building, configuring, and running Llama Stack distributions
Client code in multiple languages, including Python, Node.js, Kotlin, and Swift
Docker containers for easy deployment
Multiple distribution options, including single-node, cloud, on-device, and on-premises solutions
This comprehensive toolkit is designed to lower the barrier to entry for developers looking to build with Llama models. It's a significant step towards making advanced AI more accessible to a broader range of developers and organizations.
Of course, with great power comes great responsibility. Meta is taking steps to ensure responsible AI development. They've now also released Llama Guard 3, which includes an 11B vision model for content moderation of text and image inputs. This is crucial for applications that need to filter potentially harmful or inappropriate content.
For mobile devices, Meta released a highly optimized 1B version of Llama Guard. This model has been pruned and quantized, reducing its size from 2,858 MB to just 438 MB. This makes it feasible to deploy robust content moderation even on resource-constrained devices.
Switching gears, let's talk about the Llama 3.2 training process. For the vision models, Meta used a multi-stage approach. They started with the pre-trained Llama 3.1 text models and added image adapters and encoders. Then they pre-trained on large-scale noisy image-text pair data, followed by training on high-quality, knowledge-enhanced data. The post-training process involved several rounds of alignment, including supervised fine-tuning, rejection sampling, and direct preference optimization.
For training the lightweight models, Meta employed techniques like pruning and knowledge distillation. Pruning allowed them to reduce the size of existing models while retaining as much performance as possible. Knowledge distillation, on the other hand, used larger neural networks to impart knowledge to smaller ones, enabling these compact models to achieve better performance than they could if trained from scratch.
Sound exciting and want to get started building with Llama 3.2? Well, the entire family of Llama 3.2 models is available for immediate download and is also available for development across a broad ecosystem of partner platforms. This includes major players like AWS, Google Cloud, Microsoft Azure, and many others. The breadth of support ensures that developers have the flexibility to work with these models in their preferred environments.
Meta has also been working closely with hardware partners like Qualcomm, MediaTek, and Arm to optimize these models for mobile devices. This collaboration ensures that Llama 3.2 can run efficiently on a wide range of mobile hardware, opening up new possibilities for on-device AI applications.
This is a big deal and I’m grateful that Meta continues to invest so much capital in developing, training and releasing open-source LLMs. From vision capabilities to on-device deployment, these Llama 3.2 models open up new possibilities for developers and end-users alike. We're likely to see a wave of innovative applications leveraging these models in areas like personalized assistants, content creation tools, and intelligent document processing.
As we do this, remember that with great power comes great responsibility. As developers and data scientists, we need to be mindful of the ethical implications of deploying such powerful AI models. Meta's safety efforts like Llama Guard arm us with tools, but it's up to us to use these tools and others like them to minimize issues like bias, privacy breaches and other naughty uses.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.