VALL-E can preserve the emotional tone of the original speaker and even simulate its acoustic environment.
Microsoft engineers developed ‘VALL-E’, a new artificial intelligence (AI) tool, that can simulate the voice of a person after listening to it for only 3 seconds. The application is based on an audio compression technology called ‘EnCodec’, which has been developed by Meta (classified in Russia as an extremist organization), its authors reported in a publication pending peer review.
Microsoft took advantage of EnCodec technology as a way to make text-to-speech synthesis (TTS) sound realistic, based on a very limited source sample. During the training stage of the AI they spent 60,000 hours speaking in Englishwhich is hundreds of times larger than existing systems.
According to its creators, VALL-E displays in-context learning capabilities and can be used to synthesize a high-quality custom voice with just a 3-second recorded recording. The results of the experiment show that VALL-E significantly outperforms state of the art zero shot (not trained with the voice they simulate) TTS systems, in terms of naturalness of speech and similarity of the speaker. Furthermore, they argue that VALL-E could preserve the speaker’s emotion and the acoustic environment in the speech message synthesized from the text.
Despite its notable achievements, Microsoft researchers drew attention to some problems with the tool. In particular, they criticized that some words may be unclear, lost or duplicated in speech synthesis. Another aspect pointed out was that it still cannot cover everyone’s voice, especially that of accented speakers.. They also argued that the diversity of speaking styles is not enough, since LibriLight (the database they used for training) is an audiobook dataset, in which most utterances are in reading style.
Microsoft engineers warned that VALL-E could synthesize speech that maintains the identity of the speaker, which may carry potential risks in misuse of the model. An example of this could be spoofing voice identification or impersonating a specific speaker to produce a ‘deepfake’.
The ‘deepfake’, or deep falsifications, are video, image or voice files created using an artificial intelligence program to very realistically supplant the image of the protagonists of the content with those of other people.