NewsTalking artificial intelligence from Microsoft is capable of imitating a voice by...

    Talking artificial intelligence from Microsoft is capable of imitating a voice by listening to it for only 3 seconds

    VALL-E can preserve the emotional tone of the original speaker and even simulate its acoustic environment.

    Microsoft engineers developed ‘VALL-E’, a new artificial intelligence (AI) tool, that can simulate the voice of a person after listening to it for only 3 seconds. The application is based on an audio compression technology called ‘EnCodec’, which has been developed by Meta (classified in Russia as an extremist organization), its authors reported in a publication pending peer review.

    A podcast 'resurrects' Steve Jobs with a fictional interview generated by artificial intelligence

    Microsoft took advantage of EnCodec technology as a way to make text-to-speech synthesis (TTS) sound realistic, based on a very limited source sample. During the training stage of the AI ​​they spent 60,000 hours speaking in Englishwhich is hundreds of times larger than existing systems.

    Read Also:   The US conducts its second test of the Minuteman III intercontinental ballistic missile in three weeks


    According to its creators, VALL-E displays in-context learning capabilities and can be used to synthesize a high-quality custom voice with just a 3-second recorded recording. The results of the experiment show that VALL-E significantly outperforms state of the art zero shot (not trained with the voice they simulate) TTS systems, in terms of naturalness of speech and similarity of the speaker. Furthermore, they argue that VALL-E could preserve the speaker’s emotion and the acoustic environment in the speech message synthesized from the text.

    Read Also:   Prince Harry killed 25 Taliban in Afghanistan: "They were chess pieces taken off the board"


    Despite its notable achievements, Microsoft researchers drew attention to some problems with the tool. In particular, they criticized that some words may be unclear, lost or duplicated in speech synthesis. Another aspect pointed out was that it still cannot cover everyone’s voice, especially that of accented speakers.. They also argued that the diversity of speaking styles is not enough, since LibriLight (the database they used for training) is an audiobook dataset, in which most utterances are in reading style.

    What effect will China's unprecedented new rules on deepfakes have on the world?

    Read Also:   Scientists learn to accurately predict stroke risk from genes


    Microsoft engineers warned that VALL-E could synthesize speech that maintains the identity of the speaker, which may carry potential risks in misuse of the model. An example of this could be spoofing voice identification or impersonating a specific speaker to produce a ‘deepfake’.

    The ‘deepfake’, or deep falsifications, are video, image or voice files created using an artificial intelligence program to very realistically supplant the image of the protagonists of the content with those of other people.

    Source: RT

    This post is posted by Awutar staff members. Awutar is a global multimedia website. Our Email: [email protected]


    Please enter your comment!
    Please enter your name here

    four × four =

    Subscribe & Get Latest News