Text-to-speech models take in text as an input (e.g., a sentence that you type out) and then output an audio waveform that sounds like a human reading out the sentence you provided as an input. TTS systems like this have been around for decades, but until the past few years the quality of the audio was not compellingly human-like. Five years ago, in 2018, Google stunned attendees at its Google I/O conference with an algorithm called Google Duplex that marked a step change in the quality of TTS: Initially capable of making restaurant reservations, Duplex sounded compelling human-like because of its capacity to “um” and “uh” and stammer like humans do when they engage in natural, unscripted conversation.
Earlier this month, Microsoft its “text-to-speech” model called VALL-E (spelt the same was as OpenAI’s popular DALL-E series of text-to-image models; I couldn’t find an explanation of why they called it VALL-E, but perhaps it’s related to the “uncanny valley” concept that refers to humans unpleasant reaction to machines that closely mimic humans). Relative to baseline TTS models, VALL-E doesn’t produce game-changing audio quality of human voices — as Duplex illustrated, we’ve had human-level TTS for five years. What VALL-E does do that is game-changing is that, in addition to a text prompt, you can also provide it with just three seconds of a recording of someone’s voice and it will generate audio that is compellingly in the style of that recorded-person’s voice. Just three seconds!
To illustrate how cool and effective this new VALL-E model is, here are some examples of VALL-E outputting this sentence from the classic historical-romance novel by James Fenimore Cooper called The Last of the Mohicans: “Notwithstanding the high resolution of hawkeye, he fully comprehended all the difficulties and danger he was about to incur.”
So that sentence is provided as a typed prompt to VALL-E for it to generate. Alongside the typed input prompt, you also provide VALL-E with three seconds of someone speaking. So here’s one example:
And here’s VALL-E’s imitation of that speaker’s style, but outputting the natural language of the Last of the Mohicans quote:
Pretty amazing, right? Here’s a second speaker’s style:
And now here’s VALL-E’s imitation of that style, again outputting the Mohicans quote:
Ha! And third time’s the charm. Here’s one final speaker style input:
And here again is VALL-E’s output of the Mohicans quote in the third speaker’s style:
To get examples of how VALL-E performs relative the previous state-of-the-art baseline, you can refer to VALL-E demo GitHub link.
Having heard these amazingly realistic and accurate outputs based on just a three-second sample of someone speaking, your next thought is likely to be how scary that is. If a scam artist has access to VALL-E and just a three-second clip of you speaking, they could use it to send recordings or perhaps even generate responses in close-to-real-time with a loved one or colleague of yours, convincing them that it’s you that they’re dealing with. If you got a voicemail from your boss telling you to buy gift cards from an electronics store and to provide her with the unique gift card code on the back, would you do it? Well, maybe you’d be suspicious and phone them because you’re up to date on the state-of-the-art in A.I., but a lot of people out there could be had by such a scam. As another example, if you received a voicemail from a loved one saying they were in prison and you need to wire bail to a specific crypto wallet… you yourself might be suspicious but a lot of folks out there could be conned.
So certainly there are ethical concerns here, but this is a world we’re going to have to get used to. Generative A.I. capabilities across images, video, and audio are becoming increasingly compelling. Thankfully, technology does offer solutions. For example, while I wouldn’t recommend that you purchase NFT art, non-fungible tokens could be used to helpfully verify that a media file was genuinely created by a trusted source.
Ok, now that you know what VALL-E is as well as its implications, for you SuperDataScience super nerds out there, here are a few key points on how Microsoft trained and architected their VALL-E model: They used a hybrid model training approach that blended supervised learning on 960 hours of labeled speech data with unsupervised learning on a much much larger data set — more than 60x larger — of 60,000 hours of unlabelled training data from around 7000 different human speakers. (This kind of hybrid machine learning approach allows us to take advantage of large unlabeled training data sets such as this.)
In terms of model architecture, the VALL-E creators used a transformer architecture with 12 layers, 16 attention heads, and a 1024-dimensional embedding space. To train this large language model efficiently on the huge amounts of data at their disposal, they used 16 NVIDIA Tesla V100 GPUs with 32 GB of memory each. For more details, you can check out the full paper via ArXiV.
If you’d like to learn more about transformers architectures and attention like VALL-E takes advantage of, coming up on March 1st, I’ll be hosting a virtual conference on natural language processing with large language models like BERT and the GPT series architectures. It’ll be interactive, practical, and it’ll feature some of the most influential scientists and instructors in the large natural language model space as speakers. It’ll be live in the O’Reilly platform, which many employers and universities provide access to; otherwise you can grab a free 30-day trial of O’Reilly using our special code SDSPOD23. We’ve got a link to that code ready for you in the show notes.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.