ArXiv: arXiv:2106.06406
(Accepted by ICLR2022)
Authors
^ Corresponding authors.
Abstract
Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples
by estimating the gradient of the data density. The framework defines the prior noise as a standard
Gaussian distribution, whereas the corresponding data distribution may be more complicated than the
standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior
noise into the data sample because of the discrepancy between the data and the prior. In this paper,
we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech
synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive
prior derived from the data statistics based on the conditional information. We formulate the
training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior
through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently
proposed diffusion-based speech generative models based on both the spectral and time domains and
show that PriorGrad achieves faster convergence and inference with superior performance, leading to
an improved perceptual quality and robustness to a smaller network capacity, and thereby
demonstrating the efficiency of a data-dependent adaptive prior.
Contents
PriorGrad Vocoder Samples
1. Main samples
2. Parameter efficiency experiments
PriorGrad Acoustic Model Samples
1. Main samples
2. Comparison to trainable diffusion prior
Comparison to state-of-the-art:
vocoder
Comparison to state-of-the-art: acoustic
model
Audio Samples in the Paper
PriorGrad Vocoder Samples
Main samples
a device which deceives nobody, and makes a book very unpleasant to read.
GT
DiffWave (500k)
PriorGrad (500K)
DiffWave (1M)
PriorGrad (1M)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
a measure which must greatly tend to discourage attempts to escape.
GT
DiffWave (500k)
PriorGrad (500K)
DiffWave (1M)
PriorGrad (1M)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
They are both drawn at once by a windlass, and the unhappy culprits remain suspended.
GT
DiffWave (500k)
PriorGrad (500K)
DiffWave (1M)
PriorGrad (1M)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
the crowd began to congregate in and about the Old Bailey.
GT
DiffWave (500k)
PriorGrad (500K)
DiffWave (1M)
PriorGrad (1M)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Parameter efficiency experiments
DiffWave (1M)
PriorGrad (1M)
DiffWave_small (1M)
PriorGrad_small (1M)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
PriorGrad Acoustic Model Samples
Note: the models are trained to be compatible with the pretrained Parallel WaveGAN (PWG) for this test.
Main samples
in being comparatively modern.
GT (PWG)
Baseline (10M, 60K)
PriorGrad (10M, 60K)
Baseline (10M, 300K)
PriorGrad (10M, 300K)
Baseline (3M, 60K)
PriorGrad (3M, 60K)
Baseline (3M, 300K)
PriorGrad (3M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
the invention of movable metal letters in the middle of the fifteenth century may justly be
considered as the invention of the art of printing.
GT (PWG)
Baseline (10M, 60K)
PriorGrad (10M, 60K)
Baseline (10M, 300K)
PriorGrad (10M, 300K)
Baseline (3M, 60K)
PriorGrad (3M, 60K)
Baseline (3M, 300K)
PriorGrad (3M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
And it is worth mention in passing that, as an example of fine typography,
GT (PWG)
Baseline (10M, 60K)
PriorGrad (10M, 60K)
Baseline (10M, 300K)
PriorGrad (10M, 300K)
Baseline (3M, 60K)
PriorGrad (3M, 60K)
Baseline (3M, 300K)
PriorGrad (3M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Printing, then, for our purpose, may be considered as the art of making books by means of movable
types.
GT (PWG)
Baseline (10M, 60K)
PriorGrad (10M, 60K)
Baseline (10M, 300K)
PriorGrad (10M, 300K)
Baseline (3M, 60K)
PriorGrad (3M, 60K)
Baseline (3M, 300K)
PriorGrad (3M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Comparison to trainable diffusion prior
in being comparatively modern.
Baseline (10M, 300K)
PriorGrad (10M, 300K)
TrainablePrior (10M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
the invention of movable metal letters in the middle of the fifteenth century may justly be
considered as the invention of the art of printing.
Baseline (10M, 300K)
PriorGrad (10M, 300K)
TrainablePrior (10M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
And it is worth mention in passing that, as an example of fine typography,
Baseline (10M, 300K)
PriorGrad (10M, 300K)
TrainablePrior (10M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Printing, then, for our purpose, may be considered as the art of making books by means of movable
types.
Baseline (10M, 300K)
PriorGrad (10M, 300K)
TrainablePrior (10M, 300K)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Comparison to state-of-the-art: vocoder
WaveGlow
WaveFlow
HiFi-GAN
DiffWave(T=6)
PriorGrad(T=6)
DiffWave(T=12)
PriorGrad(T=12)
DiffWave(T=50)
PriorGrad(T=50)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Comparison to state-of-the-art: acoustic model
Note: the models are trained to be compatible with the pretrained HiFi-GAN for this test.
GT (HiFi-GAN)
FastSpeech 2
Glow-TTS
Grad-TTS(T=2)
Grad-TTS(T=10)
Baseline(T=2)
PriorGrad(T=2)
Baseline(T=6)
PriorGrad(T=6)
Baseline(T=12)
PriorGrad(T=12)
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture
Search
FastSpeech: Fast, Robust and Controllable Text to Speech
AdaSpeech: Adaptive Text to Speech for Custom Voice
MultiSpeech: Multi-Speaker Text to Speech with Transformer
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition