AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

Author

Yuzi Yan (EE, Tsinghua University) yan-yz17@mails.tsinghua.edu.cn
Xu Tan (Microsoft Research Asia) xuta@microsoft.com
Bohan Li (Microsoft Azure Speech) bohan.li@microsoft.com
Tao Qin (Microsoft Research Asia) taoqin@microsoft.com
Sheng Zhao (Microsoft Azure Speech) szhao@microsoft.com
Yuan Shen (EE, Tsinghua University) shenyuan_ee@tsinghua.edu.cn
Tie-Yan Liu (Microsoft Research Asia) tie-yan.liu@microsoft.com

Audio Samples

All of the audio samples use MelGAN as vocoder.

Audio Quality

When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.

VCTK Female

GT	GT Mel+Vocoder	Joint-training

PPG-based	AdaSpeech	Ours(AdaSpeech 2)

VCTK Male

GT	GT Mel+Vocoder	Joint-training

PPG-based	AdaSpeech	Ours(AdaSpeech 2)

Some have accepted it as a miracle without physical explanation.

VCTK Female

GT	GT Mel+Vocoder	Joint-training

PPG-based	AdaSpeech	Ours(AdaSpeech 2)

VCTK Male

GT	GT Mel+Vocoder	Joint-training

PPG-based	AdaSpeech	Ours(AdaSpeech 2)

the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.

LJSpeech

GT	GT Mel+Vocoder	Joint-training

PPG-based	AdaSpeech	Ours(AdaSpeech 2)

Analyses on adaptation strategy

When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.

Origin	Without L2 loss constraint	Fine-tune mel encoder & decoder

Varying Adaptation Data

Please call stella.

1 samples	2 samples	5 samples	10 samples

20 samples	50 samples	100 samples

FastSpeech: Fast, Robust and Controllable Text to Speech
Semi-Supervised Neural Architecture Search
MultiSpeech: Multi-Speaker Text to Speech with Transformer
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition
DeepSinger: Singing Voice Synthesis with Data Mined From the Web
UWSpeech: Speech to Speech Translation for Unwritten Languages
AdaSpeech: Adaptive Text to Speech for Custom Voice