PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior

ArXiv: arXiv:2106.06406 (Accepted by ICLR2022)

Authors

Sang-gil Lee (Data Science & AI Lab., Seoul National University) tkdrlf9202@snu.ac.kr
Heeseung Kim (Data Science & AI Lab., Seoul National University) gmltmd789@snu.ac.kr
Chaehun Shin (Data Science & AI Lab., Seoul National University) chaehuny@snu.ac.kr
Xu Tan^ (Microsoft Research Asia) xuta@microsoft.com
Chang Liu (Microsoft Research Asia) changliu@microsoft.com
Qi Meng (Microsoft Research Asia) meq@microsoft.com
Tao Qin (Microsoft Research Asia) taoqin@microsoft.com
Wei Chen (Microsoft Research Asia) wche@microsoft.com
Sungroh Yoon^ (Data Science & AI Lab., Seoul National University) sryoon@snu.ac.kr
Tie-Yan Liu (Microsoft Research Asia) tyliu@microsoft.com

^ Corresponding authors.

Abstract

Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework defines the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently proposed diffusion-based speech generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and inference with superior performance, leading to an improved perceptual quality and robustness to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.

Contents

PriorGrad Vocoder Samples
1. Main samples
2. Parameter efficiency experiments
PriorGrad Acoustic Model Samples
1. Main samples
2. Comparison to trainable diffusion prior
Comparison to state-of-the-art: vocoder
Comparison to state-of-the-art: acoustic model

Audio Samples in the Paper

PriorGrad Vocoder Samples

Main samples

a device which deceives nobody, and makes a book very unpleasant to read.

GT	DiffWave (500k)	PriorGrad (500K)	DiffWave (1M)	PriorGrad (1M)

a measure which must greatly tend to discourage attempts to escape.

GT	DiffWave (500k)	PriorGrad (500K)	DiffWave (1M)	PriorGrad (1M)

They are both drawn at once by a windlass, and the unhappy culprits remain suspended.

GT	DiffWave (500k)	PriorGrad (500K)	DiffWave (1M)	PriorGrad (1M)

the crowd began to congregate in and about the Old Bailey.

GT	DiffWave (500k)	PriorGrad (500K)	DiffWave (1M)	PriorGrad (1M)

Parameter efficiency experiments

DiffWave (1M)	PriorGrad (1M)	DiffWave_small (1M)	PriorGrad_small (1M)

PriorGrad Acoustic Model Samples

Note: the models are trained to be compatible with the pretrained Parallel WaveGAN (PWG) for this test.

Main samples

in being comparatively modern.

GT (PWG)	Baseline (10M, 60K)	PriorGrad (10M, 60K)	Baseline (10M, 300K)	PriorGrad (10M, 300K)	Baseline (3M, 60K)	PriorGrad (3M, 60K)	Baseline (3M, 300K)	PriorGrad (3M, 300K)

the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.

GT (PWG)	Baseline (10M, 60K)	PriorGrad (10M, 60K)	Baseline (10M, 300K)	PriorGrad (10M, 300K)	Baseline (3M, 60K)	PriorGrad (3M, 60K)	Baseline (3M, 300K)	PriorGrad (3M, 300K)

And it is worth mention in passing that, as an example of fine typography,

GT (PWG)	Baseline (10M, 60K)	PriorGrad (10M, 60K)	Baseline (10M, 300K)	PriorGrad (10M, 300K)	Baseline (3M, 60K)	PriorGrad (3M, 60K)	Baseline (3M, 300K)	PriorGrad (3M, 300K)

Printing, then, for our purpose, may be considered as the art of making books by means of movable types.

GT (PWG)	Baseline (10M, 60K)	PriorGrad (10M, 60K)	Baseline (10M, 300K)	PriorGrad (10M, 300K)	Baseline (3M, 60K)	PriorGrad (3M, 60K)	Baseline (3M, 300K)	PriorGrad (3M, 300K)

Comparison to trainable diffusion prior

in being comparatively modern.

Baseline (10M, 300K)	PriorGrad (10M, 300K)	TrainablePrior (10M, 300K)

the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.

Baseline (10M, 300K)	PriorGrad (10M, 300K)	TrainablePrior (10M, 300K)

And it is worth mention in passing that, as an example of fine typography,

Baseline (10M, 300K)	PriorGrad (10M, 300K)	TrainablePrior (10M, 300K)

Printing, then, for our purpose, may be considered as the art of making books by means of movable types.

Baseline (10M, 300K)	PriorGrad (10M, 300K)	TrainablePrior (10M, 300K)

Comparison to state-of-the-art: vocoder

WaveGlow	WaveFlow	HiFi-GAN	DiffWave(T=6)	PriorGrad(T=6)	DiffWave(T=12)	PriorGrad(T=12)	DiffWave(T=50)	PriorGrad(T=50)

Comparison to state-of-the-art: acoustic model

Note: the models are trained to be compatible with the pretrained HiFi-GAN for this test.

GT (HiFi-GAN)	FastSpeech 2	Glow-TTS	Grad-TTS(T=2)	Grad-TTS(T=10)	Baseline(T=2)	PriorGrad(T=2)	Baseline(T=6)	PriorGrad(T=6)	Baseline(T=12)	PriorGrad(T=12)

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
FastSpeech: Fast, Robust and Controllable Text to Speech
AdaSpeech: Adaptive Text to Speech for Custom Voice
MultiSpeech: Multi-Speaker Text to Speech with Transformer
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition