Audio Sample from "FastSpeech: Fast, Robust and Controllable Text to Speech"
Reddit Discussions: Click me
- Yi Ren* (Zhejiang University) email@example.com
- Yangjun Ruan* (Zhejiang University) firstname.lastname@example.org
- Xu Tan (Microsoft Research) email@example.com
- Tao Qin (Microsoft Research) firstname.lastname@example.org
- Sheng Zhao (Microsoft STC Asia) Sheng.Zhao@microsoft.com
- Zhou Zhao (Zhejiang University) email@example.com
- Tie-Yan Liu (Microsoft Research) firstname.lastname@example.org
* Equal contribution.
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of target mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and can adjust voice speed smoothly. Most importantly, compared with autoregressive Transformer TTS, our model speeds up the mel-spectrogram generation by 270x and the end-to-end speech synthesis by 38x. Therefore, we call our model FastSpeech. We will release the code on Github once the paper is published.
All of the audio samples use WaveGlow as vocoder.
I will quote an extract from the reverend gentleman’s own journal.
He is quite content to die
The result of the recommendation of the committee of 1862 was the Prison Act of 1865
The felons’ side has a similar accommodation, and this mode of introducing the beverage is adopted because no publican as such
He had prospered in early life, was a slop-seller on a large scale at Bury St. Edmunds, and a sugar-baker in the metropolis
You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information.
To deliver interfaces that are significantly better suited to create and process RFC eight twenty one , RFC eight twenty two , RFC nine seventy seven , and MIME content.
Http0XX , Http1XX , Http2XX , Http3XX
was executed on a gibbet in front of his victim’s house.
For a while the preacher addresses himself to the congregation at large, who listen attentively
he put them also to the sword.
The nature of the protective assignment.
Breaking between words
as yet | no rules or regulations had been printed or prepared
that he appeared to feel deeply the force of the reverend gentleman’s observations especially | when the chaplain spoke of
after dinner | he went into hiding for a day or two
the man bousfield however | whose execution was so sadly bungled
here again | probably it was partly the love of notoriety which was the incentive
The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.
Compared with autoregressive Transformer TTS, our model speeds up the mel-spectrogram generation by 270x and the end-to-end speech synthesis by 38x.
|Inference Time vs. Mel Length (FastSpeech)||Inference Time vs. Mel Length (Transformer TTS)|
We also visualize the relationship between the inference latency and the length of the predicted mel-spectrogram sequence in the test set. It can be found from Figure that the inference latency barely increases with the length of the predicted mel-spectrogram for FastSpeech, while increases largely in Transformer TTS. This indicates that the inference speed of our method is not sensitive to the length of generated audio due to parallel generation.
Code will be released after the paper is accepted by the conference.