Microsoft’s New AI VALL-E Only Needs 3 Seconds of Audio to Simulate Your Voice

3 min readJan 26, 2023

Remember all of those sci-fi movies and TV shows where a character used a simple device to replicate a person’s voice? Well, that kind of technology might not be in the realm of sci-fi for much longer. That’s because last week, Microsoft unveiled their latest text-to-speech AI model. VALL-E claims that it can closely simulate a person’s voice when given a short 3-second clip of their voice. And when they mean simulate, it can create synthesized audio of that person saying anything. It looks to be a similar technology that is being employed to eternalize the voice of acting legend James Earl Jones that we reported last summer.

The creators of this new program believe that VALL-E could be used for what they call high-quality text-to-speech applications. Think how recent deep fake videos make it seem celebrities saying and doing things they never did. Here is how the program works according to the paper released by Microsoft: “To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.”

This wasn’t a sole venture by Microsoft. That’s because VALL-E speech-synthesis capabilities were trained on an audio library called Libri-light that was assembled by fellow tech giant, Meta. Within the library, there are over 60,000 hours of English language speech from more than 7,000 speakers, most of which were pulled from the public domain audiobook site LibriVox. Though according to the paper, for VALL-E to get a close enough resemblance to the target voice, it needs a sample that closely matches a voice from its training data set. Though some may view this as a limitation, it just means that the more voices VALL-E is able to train on, the greater its ability to mimic a target voice.

And with as much audio data being produced by humans in the digital age, one doesn’t have to wonder if it will happen, but more of when. What makes this program even more interesting is its ability to also imitate the acoustic environment of sample audio. Take a phone call for example. VALL-E will simulate both the acoustic and frequency properties of a telephone in its output. So if you feed it audio data from a phone call, it will simulate the environment as well, making the output sound as if it came from a cellphone. It’s a fascinating piece of technology that could really reshape how data access is governed later in the future.

For now, Microsoft has not provided the VALL-E code for those outside Microsoft to experiment with. In the conclusion of the paper, the team also wrote: “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.“

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

Microsoft’s New AI VALL-E Only Needs 3 Seconds of Audio to Simulate Your Voice

Written by ODSC - Open Data Science

No responses yet