A few months ago, OpenAI developed a voice-cloning AI that was too accurate to release to the public. Now, Microsoft has taken a similar step. History shows us that it's only a matter of time before someone with malicious intent exploits this technology.

In an article published on arXiv, Microsoft researchers claim that VALL-E 2 can generate "precise and natural speech with the exact voice of the original speaker, comparable to human performance." In other words, the new AI voice generator is convincing enough to be mistaken for a real person, according to its creators.

"VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech (TTS) synthesis, achieving human parity for the first time," the researchers write in the paper. "Additionally, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases."

Microsoft tested VALL-E 2 with the LibriSpeech and VCTK datasets, surpassing them with high marks. When the company claims the AI tool achieves human parity, it means that VALL-E 2 outperformed real samples in terms of robustness, similarity, and naturalness. In other words, the tool can produce natural speech virtually identical to the original speaker.

Microsoft has shared dozens of samples of VALL-E 2, available on the project's summary page. These samples are incredibly realistic and indistinguishable from a human speaker. VALL-E 2's AI even masters subtleties like placing emphasis on the correct word in a sentence, just as people do when speaking.

 

How VALL-E 2 Works

Building on the VALL-E technology, Microsoft's new AI voice tool includes two significant improvements that greatly enhance its performance. Grouped code modeling better organizes codec codes, resulting in shorter sequences that increase inference speed and solve problems associated with long sequence modeling.

Repetition-aware sampling, on the other hand, redefines the original nucleus sampling process by accounting for token repetition during decoding. This process helps stabilize decoding and avoids the infinite loop issue present in the original VALL-E.

According to Microsoft, VALL-E 2 is a research project with no immediate plans to incorporate the technology into a consumer product or make it available to the public. The company also acknowledges the potential risk of misuse, such as impersonating a specific person or falsifying voice identification.

Despite these risks, Microsoft believes that VALL-E 2 could have applications in education, translation, accessibility, and chatbots, among others. The benefits of this technology could support valuable initiatives, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis (ALS).