Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

My Voice is no longer my password —

Text-to-speech model can preserve speaker’s emotional tone and acoustic environment.

Benj Edwards – Jan 9, 2023 10:15 pm UTC

An AI-generated image of a person's silhouette. — Enlarge / An AI-generated image of a person’s silhouette.
Ars Technica

On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALLEY that can closely simulate a person’s voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker’s emotional tone.

Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn’t), and audio content creation when combined with other generative AI models like GPT-3.

Microsoft calls VALL-E a “neural codec language model,” and it builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. It basically analyzes how a person sounds, breaks that information into discrete components (called “tokens”) thanks to EnCodec, and uses training data to match what it “knows” about how that voice would sound if it spoke other phrases outside of the three-second sample. Or, as Microsoft puts it in the VALL-E paper:

To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.

Microsoft trained VALL-E’s speech-synthesis capabilities on an audio library, assembled by Meta, called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.

On the VALL-E example websiteMicrosoft provides dozens of audio examples of the AI model in action. Among the samples, the “Speaker Prompt” is the three-second audio provided to VALL-E that it must imitate. The “Ground Truth” is a pre-existing recording of that same speaker saying a particular phrase for comparison purposes (sort of like the “control” in the experiment). The “Baseline” is an example of synthesis provided by a conventional text-to-speech synthesis method, and the “VALL-E” sample is the output from the VALL-E model.

Enlarge / A block diagram of VALL-E provided by Microsoft researchers.
Microsoft

While using VALL-E to generate those results, the researchers only fed the three-second “Speaker Prompt” sample and a text string (what they wanted the voice to say) into VALL-E. So compare the “Ground Truth” sample to the “VALL-E” sample. In some cases, the two samples are very close. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human’s speech, which is the goal of the model.

In addition to preserving a speaker’s vocal timbre and emotional tone, VALL-E can also imitate the “acoustic environment” of the sample audio. For example, if the sample came from a telephone call, the audio output will simulate the acoustic and frequency properties of a telephone call in its synthesized output (that’s a fancy way of saying it will sound like a telephone call, too). And Microsoft’s samples (in the “Synthesis of Diversity” section) demonstrate that VALL-E can generate variations in voice tone by changing the random seed used in the generation process.

Perhaps owing to VALL-E’s ability to potentially fuel mischief and deception, Microsoft has not provided VALL-E code for others to experiment with, so we could not test VALL-E’s capabilities. The researchers seem aware of the potential social harm that this technology could bring. For the paper’s conclusion, they write:

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

LANEIGE Lip Sleeping Mask Stocking Stuffer: Nourish, Hydrate, Vitamin C, Murumuru & Shea Butter, Antioxidants, Flaky, Dry Lips

(46777)

$20.00 (as of December 15, 2024 19:10 GMT +00:00 - )

HAPPY NUTS Comfort Cream Deodorant For Men: Anti-Chafing Sweat Defense, Odor Control, Aluminum-Free Mens Deodorant & Hygiene Products for Men's Private Parts 3.4 oz.(1 Pack, Original)

(18784)

$9.99 (as of December 15, 2024 19:32 GMT +00:00 - )

YETI Rambler 20 oz Stainless Steel Vacuum Insulated Tumbler w/MagSlider Lid

(29593)

$34.65 (as of December 15, 2024 19:32 GMT +00:00 - )

OLIGHT Oclip Pro EDC Flashlight Clip-on Light, Rechargeable Flashlights 500 Lumens with Triple Lights Type-C Charging, Portable Magnetic Light for Signaling, Cycling, Outdoor or Indoor Uses(Black)

(67)

$39.99 (as of December 15, 2024 19:03 GMT +00:00 - )

Apple iPhone 16, US Version, 128GB, Black - Unlocked (Renewed)

(5)

$734.97 (as of December 15, 2024 19:03 GMT +00:00 - )

Index Of News Author

Technology

Lamborghini's latest project is completely unexpected! NFT!

Otomobil 20/01/2022 10:00 İtalyan otomobil üreticisi Lamborghini artık teknoloji dünyasına adımını atıyor. Firma son zamanlarda popülerleşen NFT evrenine bir proje ile girmiş oluyor. Lamborghini, Space Key ile takas edilemez token (NFT) işine resmen girdi. İtalyan süper otomobil şirketi, geçtiğimiz ay sosyal medyada bir " To The Moon " kampanyasıyla insanlarla dalga geçmeye başlamıştı, şimdiyse bu şaka tamamen açıklandı. Ayrıca…

January 20, 2022

Technology

The college esports scene is ready for a boom in 2022

The esports industry has seen immense growth over the last 10 years, with the Washington Post going so far as to call the 2010’s the era of “esports adolescence.” With the introduction of Twitch in 2011 and large competitions like the first League of Legends World Championships seeing a $100,000 prize pool in the same…

January 9, 2022

Technology

Acer unveils Windows 11 laptops for gamers and more

October 4, 2021 1:00 PM The Acer Nitro is a new Windows 11 laptop for gamers.Image Credit: Acer Join gaming leaders online at GamesBeat Summit Next this upcoming November 9-10. Learn more about what comes next. Acer is launching a trio of new Windows 11 laptops including the Acer Nitro 5 for gamers. The company…

October 4, 2021

Technology

IDC: MediaTek now the biggest Android chipset supplier in the US

According to a new report from the IDC, MediaTek is now the number one chipset supplier for Android phones sold during Q4 2021 period in the US. Based on that report, MediaTek gets 48.1% of the market while Qualcomm is left with 43.9% of the pie. IDC claims MediaTek’s success is built on the back…

March 2, 2022

Technology

Amazon, Tesla, Meta considered harmful to democracy

Amazon, Meta, and Tesla have earned the rather dubious honor of being named some of the worst corporate underminers of democracy by the world's largest trade union federation. The International Trade Union Confederation (ITUC) today published a list of seven companies it said were "emblematic" of the ways large international corporations have begun tossing their

September 23, 2024

Technology

Can artificially altered clouds save the great barrier reef?

In place of its normal load of cars and vans, the repurposed ferry boat sported a mobile science laboratory and a large fan on its deck as it left Townsville, Australia, in March. Researchers dropped anchor in a coral lagoon some 100 kilometres offshore and then fired up the cone-shaped turbine, which blew a mist…

January 20, 2022

Hand-Picked Top-Read Stories

Foreign investor lawsuits impede Honduras human rights & environment protections

Water returns to Amazon rivers amid historic drought

Diriyah Center for Arts of the Future Offers AI, Robotics Events

Trending Tags

Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

My Voice is no longer my password —

Text-to-speech model can preserve speaker’s emotional tone and acoustic environment.

LANEIGE Lip Sleeping Mask Stocking Stuffer: Nourish, Hydrate, Vitamin C, Murumuru & Shea Butter, Antioxidants, Flaky, Dry Lips

HAPPY NUTS Comfort Cream Deodorant For Men: Anti-Chafing Sweat Defense, Odor Control, Aluminum-Free Mens Deodorant & Hygiene Products for Men's Private Parts 3.4 oz.(1 Pack, Original)

YETI Rambler 20 oz Stainless Steel Vacuum Insulated Tumbler w/MagSlider Lid

OLIGHT Oclip Pro EDC Flashlight Clip-on Light, Rechargeable Flashlights 500 Lumens with Triple Lights Type-C Charging, Portable Magnetic Light for Signaling, Cycling, Outdoor or Indoor Uses(Black)

Apple iPhone 16, US Version, 128GB, Black - Unlocked (Renewed)

Kevin Garnett and Ray Allen Interaction at All-Stars Prompts Wave of Memes, Jokes

YouTube Music Joins Apple Music, Tidal, and Others in Raising Its Monthly Subscription Cost to $10.99

Akhannouch: The Rural Disparities Reduction Program has achieved the majority of its territorial targeting objectives

Despite the appeal of the family, the stage did not let her go just like that: Marina Tucakovic was buried, Karleusa was in tears, Ceca was not in pain

The corona is the deadliest epidemic in U.S. history

Foreign investor lawsuits impede Honduras human rights & environment protections

Water returns to Amazon rivers amid historic drought

Diriyah Center for Arts of the Future Offers AI, Robotics Events

Philippines’ ‘extraordinary’ typhoon season was climate-fueled: Scientists

Saudi Arabia to Start Electric Motorcycles Production by 2026

Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

My Voice is no longer my password —

Text-to-speech model can preserve speaker’s emotional tone and acoustic environment.

Related Posts