Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and sematic information due to limited phoneme vocabulary. In this paper, we propose Mixed-Phoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves $3\times$ inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT.
Part 1. Audio Samples
These examples are randomly sampled from the MOS and CMOS evaluation set for Table 1 in the paper.
Audio Samples
1. The due relation of letter to pictures and other ornament was thoroughly understood by the old printers.
Recording
PWG copy synthesis
Fastspeech 2
Fastspeech 2 with unpre-trained Mixed-Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. Printing, then, for our purpose, may be considered as the art of making books by means of movable types.
Recording
PWG copy synthesis
Fastspeech 2
Fastspeech 2 with unpre-trained Mixed-Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. Scientific studies have evaluated surgical masks, but relatively few have looked at whether cloth masks can stop virus transmission.
Fastspeech 2
Fastspeech 2 with unpre-trained Mixed-Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. It has carried out three orbital corrections as well as equipment tests and remains in good condition.
Fastspeech 2
Fastspeech 2 with unpre-trained Mixed-Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. The dialogue between Europe and China has strengthened and become more balanced these past few years.
Fastspeech 2
Fastspeech 2 with unpre-trained Mixed-Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
Compared with the PnG BERT
These examples are sampled from the CMOS evaluation set for Table 2 in the paper.
1. Sometimes the notion of going out to a movie theater seems like a dream from previous life.
Fastspeech 2 with pre-trained Png BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. The fire or light, when kept in by the eyelids, equalizes the inward motions, and there is rest accompanied by few dreams.
Fastspeech 2 with pre-trained PnG BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. For although the Chinese took impressions from wood blocks engraved in relief for centuries before the wood-cutters of the Netherlands.
Fastspeech 2 with pre-trained PnG BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.
Fastspeech 2 with pre-trained PnG BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. Printing, then, for our purpose, may be considered as the art of making books by means of movable types.
Fastspeech 2 with pre-trained PnG BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
Part 2. Ablation Studies
The Effectiveness of Mixed Phoneme and Sup-Phoneme Representations
These examples are sampled from the CMOS evaluation set for Table 3 in the paper.
1. In the meantime, renters who have lost their jobs or are sick because of the virus should contact their landlords.
Fastspeech 2 with pre-trained Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. Sometimes the notion of going out to a movie theater seems like a dream from previous life.
Fastspeech 2 with pre-trained Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. All this comes at a moment when MLB faces a labor crisis that threatens the future of the entire industry.
Fastspeech 2 with pre-trained Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. The fire or light, when kept in by the eyelids, equalizes the inward motions, and there is rest accompanied by few dreams.
Fastspeech 2 with pre-trained Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. I told them I knew no remedy but leaving me behind next time, and could have told them that my looks were suitable to my fortune, though not to a feast.
Fastspeech 2 with pre-trained Phoneme BERT
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
The Effectiveness of Fine-Tuning Strategy
These examples are sampled from the CMOS evaluation set for Table 4 in the paper.
1. As demand is expected to be huge, an expert panel will help screen the proposals for the most promising candidates.
only phoneme
only sub-phoneme
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
2. His hosts, on the other hand, wore an uneasy manner that might have been the hallmark of conscious depravity.
only phoneme
only sub-phoneme
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
3. Whether much or little consideration had been directed to the result, Miss Chancellor certainly would not have incurred this reproach.
only phoneme
only sub-phoneme
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
4. Most of Caxton's own types are of an earlier character, though they also much resemble Flemish or Cologne letter.
only phoneme
only sub-phoneme
Fastspeech 2 with pre-trained Mixed-Phoneme BERT
5. I have a remedy against thirst, quite contrary to that which is good against the biting of a mad dog.