Speech-T: Transducer for Text to Speech and Beyond
Authors
- Jiawei Chen (South China University of Technology) csjiaweichen@mail.scut.edu.cn
- Xu Tan (Microsoft Research Asia) xuta@microsoft.com
- Yichong Leng (University of Science and Technology of China) lyc123go@mail.ustc.edu.cn
- Jin Xu (Tsinghua University) j-xu18@mails.tsinghua.edu.cn
- Guihua Wen (South China University of Technology) crghwen@scut.edu.cn
- Tao Qin (Microsoft Research Asia) taoqin@microsoft.com
- Tie-Yan Liu (Microsoft Research Asia) tyliu@microsoft.com
Abstract
Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS. However, it is challenging because: 1) it is difficult to trade off the emission (continuous mel-spectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer; 2) it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) addresses the first challenge with a new forward algorithm which separates the transition prediction from the continuous mel-spectrogram prediction when deriving the loss function of Transducer, and 2) addresses the second challenge with a diagonal constraint in the probability lattice to help the alignment learning. Experiments on LJSpeech datasets demonstrate that Speech-T is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech and can naturally support streaming TTS with good voice quality. Furthermore, we extend Speech-T to support both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model.
Audio Samples
All of the audio samples use Parallel WaveGAN (PWG) as vocoder.
Even the Caslon type when enlarged shows great shortcomings in this respect.
GT+Vocoder | TransformerTTS | Speech-T |
---|---|---|
There is a grossness in the upper finishings of letters like the c, the a, and so on.
GT+Vocoder | TransformerTTS | Speech-T |
---|---|---|
In short, it happens to this craft, as to others, that the utilitarian practice, though it professes to avoid ornament.
GT+Vocoder | TransformerTTS | Speech-T |
---|---|---|
in reading the modern figures the eyes must be strained before the reader can have any reasonable assurance.
GT+Vocoder | TransformerTTS | Speech-T |
---|---|---|
One of the differences between the fine type and the utilitarian must probably be put down to a misapprehension of a commercial necessity.
GT+Vocoder | TransformerTTS | Speech-T |
---|---|---|
Robustness Test
You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information .
TransformerTTS | Speech-T |
---|---|
C colon backslash o one two f c p a r t y backslash d e v one two backslash oasys backslash legacy backslash web backslash HELP
TransformerTTS | Speech-T |
---|---|
src backslash mapi backslash t n e f d e c dot c dot o l d backslash backslash m o z a r t f one backslash e x five
TransformerTTS | Speech-T |
---|---|
copy backslash backslash j o h n f a n four backslash scratch backslash M i c r o s o f t dot S h a r e P o i n t dot
TransformerTTS | Speech-T |
---|---|
zero zero zero zero zero zero zero zero two seven nine eight F three forty zero zero zero zero zero six four two eight zero one eight
TransformerTTS | Speech-T |
---|---|