PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior

ArXiv: arXiv:2106.06406

Authors

^ Corresponding authors.

Abstract

Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework assumes the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the audio domain, we consider the recently proposed diffusion-based audio generative models based on both the spectral and time domains and show that PriorGrad achieves a faster convergence leading to data and parameter efficiency and improved quality, and thereby demonstrating the efficiency of a data-driven adaptive prior.

Contents

PriorGrad Vocoder Samples
1. Main samples
2. Samples under different training steps
3. Ablation studies
4. Low-data experiments
PriorGrad Acoustic Model Samples
1. Main samples

Audio Samples in the Paper

PriorGrad Vocoder Samples

Main samples

a device which deceives nobody, and makes a book very unpleasant to read.

GT GT + WaveGrad (1M) GT + PriorGrad (300K) FastSpeech 2 + WaveGrad (1M) FastSpeech 2 + PriorGrad (300K)

a measure which must greatly tend to discourage attempts to escape.

GT GT + WaveGrad (1M) GT + PriorGrad (300K) FastSpeech 2 + WaveGrad (1M) FastSpeech 2 + PriorGrad (300K)

They are both drawn at once by a windlass, and the unhappy culprits remain suspended.

GT GT + WaveGrad (1M) GT + PriorGrad (300K) FastSpeech 2 + WaveGrad (1M) FastSpeech 2 + PriorGrad (300K)

the crowd began to congregate in and about the Old Bailey.

GT GT + WaveGrad (1M) GT + PriorGrad (300K) FastSpeech 2 + WaveGrad (1M) FastSpeech 2 + PriorGrad (300K)

Samples under different training steps

WaveGrad (100K) WaveGrad (500K) WaveGrad (1M)
PriorGrad (100K) PriorGrad (500K) PriorGrad (1M)

Ablation studies

PriorGrad (1M) PriorGrad - cycle (1M) WaveGrad + cycle (1M)

Low-data experiments

WaveGrad (1M) PriorGrad (1M)

PriorGrad Acoustic Model Samples

Main samples

in being comparatively modern.

GT (PWG) Baseline (Small, 300K) PriorGrad (Small, 60K) Baseline (Large, 300K) PriorGrad (Large, 60K)

the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.

GT (PWG) Baseline (Small, 300K) PriorGrad (Small, 60K) Baseline (Large, 300K) PriorGrad (Large, 60K)

And it is worth mention in passing that, as an example of fine typography,

GT (PWG) Baseline (Small, 300K) PriorGrad (Small, 60K) Baseline (Large, 300K) PriorGrad (Large, 60K)

Printing, then, for our purpose, may be considered as the art of making books by means of movable types.

GT (PWG) Baseline (Small, 300K) PriorGrad (Small, 60K) Baseline (Large, 300K) PriorGrad (Large, 60K)

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
FastSpeech: Fast, Robust and Controllable Text to Speech
AdaSpeech: Adaptive Text to Speech for Custom Voice
MultiSpeech: Multi-Speaker Text to Speech with Transformer
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition