Today, I’m going to do my best to give you a five-minute update on a groundbreaking new open-source Large Language Model called Mixtral 8x22B groundbreaking new open-source Large Language Model called Mixtral 8x22B out of an extremely hot French startup called Mistral.
Given the Mistral name, you are correct to assume that their Mixtral model is related to their company name. The “mix” part of “Mixtral” comes from the fact that the Mixtral 8x22B model is a mixture-of-experts model consisting of eight 22-billion-parameter “expert” submodels. I’ve discussed this mixture-of-experts approach before in the context of OpenAI’s GPT-4, which is rumored to also consist of eight expert submodels, each of which specializes in a different kind of natural-language-generation task. It’s probably not as clear cut as this in reality, but for the sake of providing an illustrative example, you could think of one of the submodel experts being called upon to handle code generation tasks while another one is specific to math-related tasks.
The big advantage of this mixture-of-experts approach is that, at inference time, when you actually use the model in production for real-world tasks, only a fraction of the full model needs to be used. Building on my caricature example and applying it to Mixtral 8x22B, if you ask a math question of a mixture-of-experts model a small part of the model would be used to triage your prompt to the 22-billion-parameter math-expert submodel while the other seven 22-billion-parameter submodels can remain unused. This means that, to provide a given response to your prompt, the Mixtral 8x22B LLM uses only 39 billion of its 141 billion total model parameters, saving about 75% of the cost and 75% of the time relative to if you were to use all 141 billion neurons.
Previously, Mistral released a 7-billion parameter model that outperforms other larger leading open-source LLMs such as Meta’s Llama 13B and 33B LLMs on the most popular LLM benchmark called MMLU (multi-task language understanding). Mistral more recently released their first mixture-of-experts model which consists of eight 7-billion-parameter submodels and this eclipsed the much more expensive-to-run Llama 70B to make it the most capable open-source LLM yet according to the MMLU benchmark.
With the Mixtral 8x22B release last week, Mistral outdoes their 8x7B mixture-of-experts model to set new high watermarks across:
The gamut of all major natural-language common sense, reasoning and knowledge benchmarks.
Multilingual benchmarks in major non-English languages including French, German, Spanish and Italian.
Coding benchmarks
Math benchmarks
Unlike other so-called “open-source” models like Llama 2 which have restrictions on their use, Mixtral 8x22B has an Apache 2.0 license, which is the most permissive open-source license — it allows anyone anywhere to use Mixtral 8x22B without any limitations whatsoever. It also has a pretty darn solid context window of 64k tokens and it is natively capable of calling functions so it can convert natural-language into API calls inside of a software application, allowing developers to dramatically modernize their software by allowing it to respond to natural-language requests from users.
All in all, Mixtral 8x22B is an important update from Mistral for data scientists and software developers all over the world, allowing for state-of-the-art open-source LLM performance across many natural human languages, coding and math-related text-generation tasks, all while being less expensive to run in production than the models like Llama 70B that it overtook. While LLM benchmarks should not be trusted on their own, the anecdotal response of users of Mixtral 8x22B online suggests it lives up to its quantitative hype. You can download the model today to adapt to your own personal or professional uses.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.