In episode, #761, we detailed the public release of Google’s Gemini Ultra, the only LLM that is in the same class as OpenAI’s GPT-4 in terms of capabilities. Well, hot on the heels of that announcement, is the release of Gemini Pro 1.5.

To recap quickly, Google’s Gemini family contains models of three sizes:

Nano is the smallest and intended for edge devices.
Pro is the mid-size model, intended for most use cases
Ultra is the largest and intended for cutting-edge capabilities

So, with that in mind, the first crazy thing about the Gemini Pro 1.5 is that this mid-size model has comparable capabilities to the Gemini Ultra 1.0 that was only released a couple weeks ago. If accurate (and my anecdotal experience with Gemini Pro 1.5 so far suggests it is), this means that the mid-size Gemini Pro 1.5 is close to the capabilities of GPT-4 and, because it’s only a mid-size model, means it’s faster and more affordable to use than either Gemini Ultra or GPT-4.

How did Google pull this feat off? They used the same Mixture-of-Experts (MoE) approach as OpenAI did for GPT-4 but evidently to more dramatic effect. The way MoE architectures work is they consist of multiple different LLMs — one of these submodels might specialize in, say, math, while another specializes in code and a third specializes in literature. Depending on the input you provide, your request will be routed to one of these specialized LLM submodels. It’s not known how many “expert” LLMs make up the Gemini Pro 1.5 architecture, but for a rough benchmark to aid your understanding of the MoE concept, GPT-4 is rumored to have eight. It’s not public how Google is using the MoE approach so effectively, but as the company that originally published on the approach in 2017, it perhaps isn’t surprising that they’ve managed to overtake OpenAI on implementing it effectively.

Beyond the high level of capabilities, the second crazy thing about Gemini 1.5 Pro is that it has a million-token context window. For comparison:

The OpenAI model with the largest context window is its GPT-4 Turbo, which has a 128k token context.
The foundation model with the largest context window, period, is Anthropic’s Claude 2 model. Its 200k-token context window is just a fifth of Gemini 1.5 Pro’s.

For reference, a context of a million tokens corresponds to about 700,000 words. Given that novels are seldom longer than 100,000 words, this means you could drop text the length of seven or more typical-length novels into Gemini 1.5 Pro and ask questions about them all at once. My friend Allie Miller demonstrated this capability by dropping eight quarters’ worth of Amazon shareholder reports and earnings call transcripts into Gemini 1.5 Pro and the model provided insightful answers to questions like "What was an Amazon focus for 2022 that is weirdly absent from the 2023 shareholder calls and reports?"

Ok, implementing a huge context window is one thing, but is it just for show or does it really work? That is, how do you know a model is actually able to attend to the important information from across that huge window? Well, according to Google, they used a so-called “Needle In A Haystack” evaluation, wherein a small piece of text containing a particular fact is placed within a long block of text, and Gemini 1.5 Pro was able to find the embedded text within a million-token input context 99% of the time. This attention over huge stretches of text allows Gemini 1.5 Pro to learn new skills from a long prompt; for example, you can provide the model with a grammar manual for a language that is outside of its training data and it will be able to translate from English into that new language at a similar level to a human learning the same content. But, an LLM like Gemini can do this learning many orders of magnitude more quickly than a person.

As if all that I’ve mentioned so far wasn’t enough, the third crazy thing about Gemini 1.5 Pro is that it is multimodal. We’ve become accustomed in recent months to the multimodal capabilities of GPT-4V, which accepts both images and text as input, but Gemini 1.5 Pro goes a big step further by accepting audio or video as inputs too. With its million-token context window, this allows Gemini 1.5 Pro to be fed:

An hour of video
11 hours of audio
30,000 lines of code, or
Again, 700,000 words

You can watch my demo of Gemini 1.5 Pro, wherein I upload a 54-minute-long video and ask the model questions about the video contents (to initially mixed but then, later, incredible results) from minute marks: 7:03-13:49.

To recap, Gemini 1.5 Pro:

Approximates GPT-4’s capabilities but is much smaller, faster and cheaper.
Has a million-token context window, which is five times larger than the input accepted by Claude 2, the closest contender on the context window front.
It’s multimodal, accepting text, code, images, audio and video as input.

Well, that’s the end of the crazy-feature exposition for today, but looking ahead a bit, Google also reports that the company has 10-million-token context-window models under development. That corresponds to a model that can be prompted with roughly ten hours of video, 100 hours of audio or 70 novels. It’s wild how exponential A.I. developments are! Hope you find it exciting and have the cogs turning in your mind on all the ways you could take advantage of these emerging capabilities at work, in products you’re developing or in your personal life.

If you can’t wait to get started with Gemini 1.5 Pro, you can access it now via Google AI Studio while enterprises can access it via GCP’s Vertex AI. As of recording, by default you can only access a 128,000-token version of Gemini 1.5 Pro, but you can join the waitlist for the million-token version.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.