DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling


ArXiv: arXiv:2012.09547 (Accepted by ICASSP2021)

Authors

Abstract

While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge in two ways: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle noisy speech with complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with only noisy speech. We carefully design a noise condition module that is jointly trained with the TTS model and can capture the fine-grained frame-level noise information. Experimental results show that DenoiSpeech can generate clean speech in complicated noisy situations.

Audio Samples

All of the audio samples use Parallel WaveGAN (PWG) as vocoder.

Artificial Noisy Speech Corpus

We use the VCTK corpus for speech information and the Nonspeech100 for background noise.

Comparison with Other Models

Did he trip?

Clean GT Noisy GT Enhancement-Based
Augment-Adversarial DenoiSpeech

Celtic were the losers then.

Clean GT Noisy GT Enhancement-Based
Augment-Adversarial DenoiSpeech

I saw the story, but that amount wouldn’t even pay my commission.

Clean GT Noisy GT Enhancement-Based
Augment-Adversarial DenoiSpeech

Basically we lost the game because we were outplayed.

Clean GT Noisy GT Enhancement-Based
Augment-Adversarial DenoiSpeech

Ablation Study

Did he trip?

DenoiSpeech Without Noise Condition Utterance-Level Noise Condition
Fixed Noise Extractor Without Adversarial CTC Module

Basically we lost the game because we were outplayed.

DenoiSpeech Without Noise Condition Utterance-Level Noise Condition
Fixed Noise Extractor Without Adversarial CTC Module

Short-Time Noise Case

Basically we lost the game because we were outplayed.

Clean GT Noisy GT Enhancement-Based
Augment-Adversarial DenoiSpeech

Real-World Noisy Speech Corpus

I want to sit down first.

Noisy GT DenoiSpeech
Augment-Adversarial Enhancement-Based

Customers swarmed into the store.

Noisy GT DenoiSpeech
Augment-Adversarial Enhancement-Based

Arizona already had a law in place that prohibited state-funded abortions.

Noisy GT DenoiSpeech
Augment-Adversarial Enhancement-Based

Clinton poured cold water on such action.

Noisy GT DenoiSpeech
Augment-Adversarial Enhancement-Based

Visualization of Extracted Noise

Here are some comparisons of ground-truth noise and extracted noise by our noise extractor.

GT Noise Extracted Noise
GT Noise Extracted Noise
GT Noise Extracted Noise
GT Noise Extracted Noise

SMOS (Similarity MOS)

We randomly selected 12 utterances (which are all pronounced by a speaker) from the test set, and distributed them to 10 participants (native speakers). Each participant received the same 12 utterances with different orders and give the SMOS score (compared with Clean GT).

Almost Unsupervised Text to Speech and Automatic Speech Recognition
FastSpeech: Fast, Robust and Controllable Text to Speech
Semi-Supervised Neural Architecture Search
MultiSpeech: Multi-Speaker Text to Speech with Transformer
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition
DeepSinger: Singing Voice Synthesis with Data Mined From the Web
UWSpeech: Speech to Speech Translation for Unwritten Languages