In previous episodes of the SuperDataScience Podcast, such as #570, I’ve discussed DALL-E 2, a model by the research outfit OpenAI that creates stunningly realistic and creative images based on whatever text input your heart desires.
For today’s Five-Minute Friday episode, it’s my pleasure to introduce you to the Imagen Video model published upon just a few weeks ago by researchers from Google.
First, let’s talk about the clever name: While pronounced “imagine” to allude to the creativity of the model and the users who provide text prompts to it, the Imagen model name is a portmanteau of the words “image” and “generation”, which is rather sensible given that the model generates images. The original Imagen model was released earlier this year and — like the better-known but perhaps not better-performing DALL-E 2 — the original Imagen model generates still images. The new Imagen Video model takes this generative capacity into another dimension — the dimension of time — by generating short video clips of essentially whatever video clip you prompt it to generate.
For example, if you prompt Imagen Video to generate a video of “an astronaut riding a horse”, it will do precisely that. If you prompt Imagen Video to generate a video of “a happy elephant wearing a birthday hat walking under the sea”, well, then it will of course do precisely that too! In the show notes, we’ve provided a link to a staggering 4x4 matrix of videos created by Imagen Video that I highly recommend you check out to get a sense of how impressive this model is.
Under the hood, Imagen Video is the combination of three separate components:
T5 text encoder, which is a transformer-based architecture that infers the meaning of the natural-language prompt you provide to it as an input (check out episode #559 to hear more about transformers, which have become the standard for state-of-the-art results in natural language processing and increasingly in machine vision too. Interestingly, this T5 encoder is frozen during training of the Imagen Video model so the T5 model weights are left unchanged by training — T5’s natural language processing capabilities are thus used “out of the box” for Imagen Video’s purposes.
Base diffusion model, which creates the basic frames of the video. This works similarly to the popular “autoencoder” architecture in that it deconstructs an image into an abstract representation (in the case of Imagen Video this abstract representation looks like TV static) and then it learns how to reconstruct the original image from that abstract representation. Critically, the base diffusion model of Imagen Video operates on multiple video frames simultaneously and then improves further on the coherence across all the frames of the videos using something called “temporal attention”. Unlike some previous video-generation approaches, these innovations result in frames that make more sense together, ultimately resulting in a more coherent video clip.
Finally, interleaved spatial and temporal super-resolution diffusion models work together to upsample the basic frames created by the base diffusion model to a higher resolution. Since this stage involves working with high-definition images (much more data), the memory and computational complexity considerations are particularly important. Thus, this final stage leverages convolutions (a relatively simplistic operation that has become a standard in deep learning machine vision models over the past decade) instead of the more complex temporal attention approach of the base diffusion model.
Now that you know how Imagen Video works, you might be dying to try it out yourself. Regrettably, Google hasn’t released the model or source code publicly due to concerns about explicit, violent, and harmful content that could be generated with it. Because of the sheer scale of natural language scraped from the Internet and then used to train T5 and Imagen Video, it’s difficult to comprehensively filter out problematic data, including data that reinforce social biases or stereotypes against particular groups.
Despite our inability to use Imagen Video ourselves, it is nevertheless a staggering development in the fields of natural language processing and creative artificial intelligence. Hopefully forthcoming approaches can resolve the thorniest social issues presented by these models so that we can all benefit from innovations like this.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.