NaturalSpeech 3:

Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

[Paper]

Zeqian Ju^1,2*, Yuancheng Wang^3*, Kai Shen^4,1*, Xu Tan^1*, Detai Xin^1,5, Dongchao Yang¹, Yanqing Liu¹, Yichong Leng¹, Kaitao Song¹, Siliang Tang⁴, Zhizheng Wu³, Tao Qin¹,

Xiang-Yang Li², Wei Ye⁶, Shikun Zhang⁶, Jiang Bian¹, Lei He¹, Jinyu Li¹, Sheng Zhao¹

¹Microsoft Research Asia & Microsoft Azure Speech ²University of Science and Technology of China

³The Chinese University of Hong Kong, Shenzhen ⁴Zhejiang University,

⁵The University of Tokyo, ⁶Peking University

Abstract. While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Update: We release the code and pre-trained checkpoint of FACodec at here !

This research is done in alignment with Microsoft's responsible AI principles.

This page is for research demonstration purposes only.

Overview

(a) Overview of NaturalSpeech 3, with a neural speech codec for speech attribute factorization and a factorized diffusion model. (b) Data and model scaling of NaturalSpeech 3 on an internal test set.

Zero-Shot TTS Samples (Emotion)

The text "Why fades the lotus of the water" is sampled from LibriSpeech.

Prompt Emotion	Prompt	NaturalSpeech 3
neutral
happy
calm
sad
angry
fearful
disgust
surprised

Zero-Shot TTS Samples

Text	Prompt	NaturalSpeech 3
They think you're proud because you've been away to school or something.
There was an average cost per lamp for meter operation of twenty two cents a year, and each meter took care of an average of seventeen lamps.
For the past ten years, Conseil had gone with me wherever science beckoned.
There were only four stationers of any consequences in the town, and at each Holmes produced his pencil chips, and bid high for a duplicate.

Attribute Manipulation

Duration manipulation.

Duration Prompt	Other Prompts	NaturalSpeech 3
Original prompt.
0.7x speed.
1.3x speed.
New prompt with a fast speech rate.

Prosody manipulation.

Prosody Prompt	Other Prompts	NaturalSpeech 3
Original prompt.
New prompt in sad emotion.

Timbre manipulation.

Timbre Prompt	Other Prompts	NaturalSpeech 3
Original prompt.
New prompt from a different speaker.

Comparision Results on LibriSpeech Benchmark

(P) denotes results from paper. (R) denotes reproduce.

-	Sim-O↑	Sim-R↑	WER↓	CMOS↑	SMOS↑
Ground Truth	0.68	-	1.94	+0.08	3.85
NaturalSpeech 2	0.55	0.62	1.94	-0.18	3.65
Voicebox	0.64	0.67	2.03	-0.23	3.69
Voicebox (R)	0.48	0.50	2.14	-0.32	3.52
VALL-E (P)	-	0.58	5.90	-	-
VALL-E (R)	0.47	0.51	6.11	-0.60	3.46
Mega-TTS 2	0.53	-	2.32	-0.20	3.63
UniAudio	0.57	0.68	2.49	-0.25	3.71
StyleTTS 2	0.38	-	2.49	-0.21	3.07
HierSpeech++	0.51	-	6.33	-0.41	3.50
NaturalSpeech 3	0.67	0.76	1.81	0.00	4.01

(R) denotes reproduce.

Text	NaturalSpeech 3	NaturalSpeech 2	Voicebox	Voicebox (R)	VALL-E (R)	Mega-TTS 2	UniAudio	StyleTTS 2	HierSpeech++
It is this that is of interest to theory of knowledge.	Sim-O: 0.73	Sim-O: 0.65	Sim-O: 0.71	Sim-O: 0.49	Sim-O: 0.55	Sim-O: 0.55	Sim-O: 0.62	Sim-O: 0.39	Sim-O: 0.59
For, like as not, they must have thought him a prince when they saw his fine cap.	Sim-O: 0.73	Sim-O: 0.43	Sim-O: 0.62	Sim-O: 0.54	Sim-O: 0.45	Sim-O: 0.42	Sim-O: 0.47	Sim-O: 0.40	Sim-O: 46
What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.	Sim-O: 0.69	Sim-O: 0.66	Sim-O: 0.59	Sim-O: 0.59	Sim-O: 0.61	Sim-O: 0.46	Sim-O: 0.53	Sim-O: 0.52	Sim-O: 0.49
The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.	Sim-O: 0.81	Sim-O: 0.64	Sim-O: 0.67	Sim-O: 0.65	Sim-O: 0.65	Sim-O: 0.60	Sim-O: 0.45	Sim-O: 0.44	Sim-O: 0.55

Comparision Results on Ravdess Benchmark

(R) denotes reproduce.

-	average MCD↓	MCD-Acc↑	CMOS↑	SMOS↑
Ground Truth	0.00	1.00	+0.17	4.42
NaturalSpeech 2	4.56	0.25	-0.22	4.04
Voicebox (R)	4.88	0.34	-0.34	3.92
VALL-E (R)	5.03	0.34	-0.55	3.80
Mega-TTS 2	4.44	0.39	-0.20	4.51
StyleTTS 2	4.50	0.40	-0.25	3.98
HierSpeech++	6.08	0.30	-0.37	3.87
NaturalSpeech 3	4.28	0.52	0.00	4.72

Ravdess has only two texts: "Dogs are sitting by the door." for prompt text, and "Kids are talking by the door." for synthesis text. (R) denotes reproduce.

Prompt Emotion	Prompt	Ground Truth	NaturalSpeech 3	NaturalSpeech 2	Voicebox (R)	VALL-E (R)	Mega-TTS 2	StyleTTS 2	HierSpeech++
neutral
happy
calm
sad
angry
fearful
disgust
surprised

FACodec: Reconstruction Samples

(R) denotes reproduce.

Ground Truth	FACodec 4.8 kbps	SoundStream (R) 4.8 kbps	SoundStream (R) 9.6 kbps	Encodec 6.0 kbps	Encodec (R) 5.0 kbps	HiFi-Codec 2.0 kbps	DAC 4.5 kbps

FACodec: Voice Conversion Samples

Prompt	Source	FACodec

Ethics Statement

Since our model could synthesize speech with great speaker similarity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. To prevent misuse, it is crucial to develop a robust synthesized speech detection model and establish a system for individuals to report any suspected misuse.

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

PromptTTS 2: Describing and Generating Voices with Text Prompt

Speech research conducted at Microsoft Research Asia

NaturalSpeech 3:

Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Overview

Zero-Shot TTS Samples (Emotion)

Zero-Shot TTS Samples

Attribute Manipulation

Comparision Results on LibriSpeech Benchmark

Comparision Results on Ravdess Benchmark

FACodec: Reconstruction Samples

FACodec: Voice Conversion Samples

Ethics Statement

Other Related Works