Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech


ArXiv: arXiv:2203.17190

Authors

  • Guangyan Zhang (The Chinese University of Hong Kong) gyzhang@link.cuhk.edu.hk
  • Kaitao Song (Microsoft Research Asia)
  • Xu Tan (Microsoft Research Asia) xuta@microsoft.com
  • Daxin Tan (The Chinese University of Hong Kong)
  • Yuzi Yan (Tsinghua University)
  • Yanqing Liu (Microsoft Azure Speech)
  • Gang Wang (Microsoft Azure Speech)
  • Wei Zhou (Zhejiang University)
  • Tao Qin (Microsoft Research Asia)
  • Tan Lee (The Chinese University of Hong Kong)
  • Sheng Zhao (Microsoft Azure Speech)

Abstract

Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and sematic information due to limited phoneme vocabulary. In this paper, we propose Mixed-Phoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves $3\times$ inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT.

Part 1. Audio Samples

These examples are randomly sampled from the MOS and CMOS evaluation set for Table 1 in the paper.

Audio Samples

1. The due relation of letter to pictures and other ornament was thoroughly understood by the old printers.
Recording PWG copy synthesis Fastspeech 2
Fastspeech 2 with unpre-trained Mixed-Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. Printing, then, for our purpose, may be considered as the art of making books by means of movable types.
Recording PWG copy synthesis Fastspeech 2
Fastspeech 2 with unpre-trained Mixed-Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. Scientific studies have evaluated surgical masks, but relatively few have looked at whether cloth masks can stop virus transmission.
Fastspeech 2 Fastspeech 2 with unpre-trained Mixed-Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. It has carried out three orbital corrections as well as equipment tests and remains in good condition.
Fastspeech 2 Fastspeech 2 with unpre-trained Mixed-Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. The dialogue between Europe and China has strengthened and become more balanced these past few years.
Fastspeech 2 Fastspeech 2 with unpre-trained Mixed-Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT

Compared with the PnG BERT

These examples are sampled from the CMOS evaluation set for Table 2 in the paper.
1. Sometimes the notion of going out to a movie theater seems like a dream from previous life.
Fastspeech 2 with pre-trained Png BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. The fire or light, when kept in by the eyelids, equalizes the inward motions, and there is rest accompanied by few dreams.
Fastspeech 2 with pre-trained PnG BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. For although the Chinese took impressions from wood blocks engraved in relief for centuries before the wood-cutters of the Netherlands.
Fastspeech 2 with pre-trained PnG BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.
Fastspeech 2 with pre-trained PnG BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. Printing, then, for our purpose, may be considered as the art of making books by means of movable types.
Fastspeech 2 with pre-trained PnG BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT

Part 2. Ablation Studies

The Effectiveness of Mixed Phoneme and Sup-Phoneme Representations

These examples are sampled from the CMOS evaluation set for Table 3 in the paper.
1. In the meantime, renters who have lost their jobs or are sick because of the virus should contact their landlords.
Fastspeech 2 with pre-trained Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. Sometimes the notion of going out to a movie theater seems like a dream from previous life.
Fastspeech 2 with pre-trained Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. All this comes at a moment when MLB faces a labor crisis that threatens the future of the entire industry.
Fastspeech 2 with pre-trained Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. The fire or light, when kept in by the eyelids, equalizes the inward motions, and there is rest accompanied by few dreams.
Fastspeech 2 with pre-trained Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. I told them I knew no remedy but leaving me behind next time, and could have told them that my looks were suitable to my fortune, though not to a feast.
Fastspeech 2 with pre-trained Phoneme BERT Fastspeech 2 with pre-trained Mixed-Phoneme BERT

The Effectiveness of Fine-Tuning Strategy

These examples are sampled from the CMOS evaluation set for Table 4 in the paper.
1. As demand is expected to be huge, an expert panel will help screen the proposals for the most promising candidates.
only phoneme only sub-phoneme Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. His hosts, on the other hand, wore an uneasy manner that might have been the hallmark of conscious depravity.
only phoneme only sub-phoneme Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. Whether much or little consideration had been directed to the result, Miss Chancellor certainly would not have incurred this reproach.
only phoneme only sub-phoneme Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. Most of Caxton's own types are of an earlier character, though they also much resemble Flemish or Cologne letter.
only phoneme only sub-phoneme Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. I have a remedy against thirst, quite contrary to that which is good against the biting of a mad dog.
only phoneme only sub-phoneme Fastspeech 2 with pre-trained Mixed-Phoneme BERT