NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
arXiv: arXiv:2205.04421
Reddit Discussions: Click me
Authors
Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan LiuMicrosoft Research Asia & Microsoft Azure Speech
Abstract
Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.
Audio Samples
Comparison with Recording
The lax discipline maintained in Newgate was still further deteriorated by the presence of two other classes of prisoners who ought never to have been inmates of such a jail.
NaturalSpeech | Recording |
---|---|
Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses.
NaturalSpeech | Recording |
---|---|
as effectually to rebuke and abash the profane spirit of the more insolent and daring of the criminals.
NaturalSpeech | Recording |
---|---|
it is not possible to state with scientific certainty that a particular small group of fibers come from a certain piece of clothing.
NaturalSpeech | Recording |
---|---|
who had borne the Queen's commission, first as cornet, and then lieutenant, in the 10th Hussars.
NaturalSpeech | Recording |
---|---|
Our Related Works
Some speech research conducted at Microsoft Research Asia
DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021
FastSpeech: Fast, Robust and Controllable Text to Speech
FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data
AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior