Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

My Voice is no longer my password —

Text-to-speech model can preserve speaker’s emotional tone and acoustic environment.

Benj Edwards – Jan 9, 2023 10:15 pm UTC

An AI-generated image of a person's silhouette. — Enlarge / An AI-generated image of a person’s silhouette.
Ars Technica

On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALLEY that can closely simulate a person’s voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker’s emotional tone.

Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn’t), and audio content creation when combined with other generative AI models like GPT-3.

Microsoft calls VALL-E a “neural codec language model,” and it builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. It basically analyzes how a person sounds, breaks that information into discrete components (called “tokens”) thanks to EnCodec, and uses training data to match what it “knows” about how that voice would sound if it spoke other phrases outside of the three-second sample. Or, as Microsoft puts it in the VALL-E paper:

To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.

Microsoft trained VALL-E’s speech-synthesis capabilities on an audio library, assembled by Meta, called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.

On the VALL-E example websiteMicrosoft provides dozens of audio examples of the AI model in action. Among the samples, the “Speaker Prompt” is the three-second audio provided to VALL-E that it must imitate. The “Ground Truth” is a pre-existing recording of that same speaker saying a particular phrase for comparison purposes (sort of like the “control” in the experiment). The “Baseline” is an example of synthesis provided by a conventional text-to-speech synthesis method, and the “VALL-E” sample is the output from the VALL-E model.

Enlarge / A block diagram of VALL-E provided by Microsoft researchers.
Microsoft

While using VALL-E to generate those results, the researchers only fed the three-second “Speaker Prompt” sample and a text string (what they wanted the voice to say) into VALL-E. So compare the “Ground Truth” sample to the “VALL-E” sample. In some cases, the two samples are very close. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human’s speech, which is the goal of the model.

In addition to preserving a speaker’s vocal timbre and emotional tone, VALL-E can also imitate the “acoustic environment” of the sample audio. For example, if the sample came from a telephone call, the audio output will simulate the acoustic and frequency properties of a telephone call in its synthesized output (that’s a fancy way of saying it will sound like a telephone call, too). And Microsoft’s samples (in the “Synthesis of Diversity” section) demonstrate that VALL-E can generate variations in voice tone by changing the random seed used in the generation process.

Perhaps owing to VALL-E’s ability to potentially fuel mischief and deception, Microsoft has not provided VALL-E code for others to experiment with, so we could not test VALL-E’s capabilities. The researchers seem aware of the potential social harm that this technology could bring. For the paper’s conclusion, they write:

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Kasa Smart Indoor Pan-Tilt Home Security Camera, 1080p HD Dog Camera w/Night Vision, Motion Detection for Baby & Pet Monitor, Cloud & SD Card Storage, Works w/ Alexa & Google Home, 2.4G WiFi (EC71)

(33975)

$17.39 (as of February 4, 2025 19:31 GMT +00:00 - )

Stick Man Slang Words Valentine's Day Cards for School, Gen Alpha Kids, Teens, Tweens, Bulk Valentines Day Gifts, 2 x 3.9 inches, Set of 24 cards

$14.99 (as of February 4, 2025 19:26 GMT +00:00 - )

Sereney 10th Birthday Gifts for Girl Sterling Silver Pink Pearl Necklace as Gifts for 10 Year Old, Adjustable Length 10 Birthday Ideas for Teens Trendy 2025

(1969)

$15.99 (as of February 4, 2025 20:03 GMT +00:00 - )

Wire rack cover removable wood shelf covering for closet, kitchen pantry, bathroom, office, hallway or mudroom shelving wooden board upgrade (16 Inch Length, 12 Inch Deep, Early American)

(8)

$64.99 (as of February 4, 2025 19:31 GMT +00:00 - )

Apple AirTag 4 Pack

(12431)

$64.39 (as of February 4, 2025 19:31 GMT +00:00 - )

Index Of News Author

Technology

Microsoft Edge Web Browser To Stop Working On These Operating Systems This Week

| Updated: Tuesday, January 10, 2023, 17:29 [IST] Microsoft Edge, the default web browser for the Windows operating system after Internet Explorer was retired, will stop functioning on two versions of the OS. Microsoft has indicated that Edge will continue working on Windows 11 and the later releases of Windows 10

January 10, 2023

Technology

Factory-sealed first-generation iPhone sells for over $63,000 at auction

iPhones tend to hold their value better than most other phones but this latest example is one for the history books. A factory sealed first-generation Apple iPhone from 2007 sold for $63,356 at an auction over the weekend. The $63,000 iPhone in its original packaging The device in question is an 8GB Apple iPhone and

February 20, 2023

Technology

Gurman: Apple ‘evaluating’ idea of releasing a TV set again

According to this weekend’s Power On newsletter from Mark Gurman, Apple is ‘evaluating’ the idea of releasing an Apple-branded TV set, once again. Gurman says that this would happen after Apple releases their upcoming smart display products, so it’s likely a couple years out. Apple is expected to release a lower end smart display, which

November 17, 2024

Technology

Face unlock option appears in Google Pixel 6 screen lock settings

Reviews, News, CPU, GPU, Articles, Columns, Other "or" search relation.3D Printing, 5G, Accessory, AI, Alder Lake, AMD, Android, Apple, ARM, Audio, Benchmark, Biotech, Business, Camera, Cannon Lake, Cezanne (Zen 3), Charts, Chinese Tech, Chromebook, Coffee Lake, Comet Lake, Console, Convertible / 2-in-1, Cryptocurrency, Cyberlaw, Deal, Desktop, E-Mobility, Education, Exclusive, Fail, Foldable, Gadget, Galaxy Note, Galaxy…

April 10, 2022

Technology

This Record-Breaking Volvo Has Over 3 Million Miles

Modern Volvo cars are renowned for their safety, but the owner of the longest-driven car in the world had other things in mind than playing it safe. Instead of storing cars in garages as an investment, science teacher Irv Gordon bought a Volvo P1800 with the intention of driving it ... a lot. In fact,…

May 23, 2022

Technology

The best automatic pet feeders of 2024

Whether your dog's extra pounds have turned into a health hazard or you just keep forgetting to fill the cat's food bowl, it's time to invest in an automatic pet feeder to keep your fur babies happy and healthy. Smart feeders help pet owners and pets keep track of when feeding time occurs as well

May 8, 2024

Hand-Picked Top-Read Stories

Colombia, Venezuela launch joint operation to counter guerrilla groups

A little girl sparkles in Oscar-nominated film – and shines a light on child labour

Sick and wounded Gazan children cross to Egypt for treatment

Trending Tags

Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

My Voice is no longer my password —

Text-to-speech model can preserve speaker’s emotional tone and acoustic environment.

Kasa Smart Indoor Pan-Tilt Home Security Camera, 1080p HD Dog Camera w/Night Vision, Motion Detection for Baby & Pet Monitor, Cloud & SD Card Storage, Works w/ Alexa & Google Home, 2.4G WiFi (EC71)

Stick Man Slang Words Valentine's Day Cards for School, Gen Alpha Kids, Teens, Tweens, Bulk Valentines Day Gifts, 2 x 3.9 inches, Set of 24 cards

Sereney 10th Birthday Gifts for Girl Sterling Silver Pink Pearl Necklace as Gifts for 10 Year Old, Adjustable Length 10 Birthday Ideas for Teens Trendy 2025

Wire rack cover removable wood shelf covering for closet, kitchen pantry, bathroom, office, hallway or mudroom shelving wooden board upgrade (16 Inch Length, 12 Inch Deep, Early American)

Apple AirTag 4 Pack

Prince William anxiously watches Euro 2024 final with Prince George – live updates

Nivy Watch is a smart watch for companies. Find out if employees work effectively

US Islamist extremists form close-knit groups with mutual contacts

Saad Dine El Otmani: The Feast of the Throne reflects the perfect symbiosis between the Alaouite Throne and the Moroccan people

U.S. Ad Forecast Raised for 2023, 2024, But Strikes, Ratings Drops Cause TV “Challenges”

Colombia, Venezuela launch joint operation to counter guerrilla groups

A little girl sparkles in Oscar-nominated film – and shines a light on child labour

Sick and wounded Gazan children cross to Egypt for treatment

Tiramisu Basque Cheesecake

13 Recipes With Phyllo Dough, Both Sweet and Savory

Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

My Voice is no longer my password —

Text-to-speech model can preserve speaker’s emotional tone and acoustic environment.

Related Posts