Microsoft's new AI tool VALL-E is a revolutionary text-to-speech model that can generate high-quality audio with a three-second audio sample.
Using Meta's EnCodec audio compression technology, VALL-E can preserve a speaker's emotional tone and acoustic environment. It has been trained on roughly 60,000 hours of voice data in the English language, making it capable of simulating any voice it hears.
With VALL-E, users have the ability to synthesise personalized speech, edit recordings, and create audio content with other generative AI models.
There are some privacy concerns and potential risks of misuse of the model, such as spoofing, voice identification and/or impersonating a specific speaker. Microsoft has said that it will work on trying to contain the issue in future iterations.
The research paper on VALL-E available at Cornell University states, "Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system [AI that recreates voices it's never heard] in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis."
Another example in the world of TTS is Tacotron 2. It is an end-to-end TTS system which has been proposed by Google Brain, that can produce natural speech given a script, It was able to generate speech with a high degree of naturalness and reduce the gap with human performance by a substantial margin, it also achieves a closer approximation to the human voice in terms of prosody and intonation.
Recently, Microsoft has also offered a $10 billion investment in OpenAI which introduced GPT-3. It is a language model that is capable of generating text, including speech, in a highly human-like manner.
Copyright©2023 Living Media India Limited. For reprint rights: Syndications Today