Diffusion probabilistic models - Applications to waveforms
Author : Philippe Esling (esling@ircam.fr)
This third notebook continues the exploration of diffusion probabilistic models [ 1 ] in our four notebook series.
- Score matching and Langevin dynamics.
- Diffusion probabilistic models and denoising
- Applications to waveforms with WaveGrad
- Implicit models to accelerate inference
Here, we quickly recall the basics of score matching [ 3 ] , Langevin dynamics and diffusion probabilistic models [ 1 ] seen in the previous notebooks, in order to discuss applications to waveform data, through the WaveGrad model [ 5 ]
Theoretical bases - quick recap
In this section we provide a quick recap on score matching and diffusion probabilistic models, still based on the swiss roll dataset.
Score matching
Score matching aims to learn the gradients (termed score) of with respect to instead of directly . Therefore, we seek a model to approximate
Finding this approximation, we can use Langevin dynamics to produce true samples from a density , by relying only on
where and under : converges to an exact sample from . Therefore, the key idea behind this score-based generative modeling approach is that we can easily train a generative model, without defining any functional form on the distribution (no need to rely on an approximation, or to complexify a simple distribution).
Implementation
We have seen that optimizing with an MSE objective was equivalent to optimizing
where denotes the Jacobian of with respect to . The complexity of computing this Jacobian can be removed in_denoising score matching_ objective [ 3 ] , by corrupting the inputs and minimizing the following objective
As it has been shown in [ 3 ] , [ 8 ] , if we choose the noise distribution to be , then we have . Therefore, the denoising score matching loss simply becomes
We can implement the denoising score matching loss as follows
Regarding optimization, we can perform a very simple implementation of this process, by defining as being any type of neural network. We can perform a minimalistic implementation and observe that our model has learned to represent by plotting the output value across the input space
Sampling
Based on our trained model , and using Langevin dynamics, we can produce true samples from a density , by relying only on , as shown in the following snippet
Diffusion probabilistic models
Diffusion probabilistic models [ 1 ] are based on on a series of latent variables that have the same dimensionality as a given input data, which is labeled as . It models two reciprocal processes as Markov chains of these random variables. One (fixed) process gradually adds noise to the input data (the diffusion or forward process), destroying the signal up to full noise. Oppositely, the reverse (parametric) process must learn how to denoise local perturbations in order to invert this diffusion process (transform random noise into a high-quality waveform). Hence, learning involves estimating a large number of small perturbations, which is more tractable than trying to directly estimate the full distribution with a single potential function, as examplified in the following figure
Therefore, we need to define the behavior of the two process, and train the reverse process (modeled as conditional Gaussians) using variational inference, which allow for neural network parameterization and tractable estimation.
Forward process
In the forward process, the data distribution is gradually converted into an analytically tractable distribution , by repeated application of a diffusion kernel, so that the complete distribution (called diffusion process) is defined as
This diffusion kernel is set to gradually inject Gaussian noise, with a fixed variance schedule such that
One important property of this forward process noted by Ho et al. [ 1 ] is that we can perform sampling at any arbitrary timestep , such that
where and . Therefore, we can create a very efficient diffusion sampling function. Note that this depends on the given variance schedule of that we compute prior to the function.
Implementing the forward process
This allows to perform a very efficient implementation of the forward process, where we can directly sample at any given timesteps, as shown in the following code.
Reverse process
The generative distribution that we aim to learn will be trained to perform the reverse trajectory, starting from Gaussian noise to gradually remove local perturbations. Therefore the reverse process starts with our given tractable distribution and is described as
Each of the transitions in this process can simply be defined as conditional Gaussians (note: which is reminiscent of the definition of VAEs). Therefore, during learning, only the mean and covariancce for a Gaussian diffusion kernel needs to be trained
The two functions defining the mean and covariance can be parametrized by deep neural networks. Note also that these functions are parametrized by , which means that a single model can be used for all time steps.
Here, we show a naive implementation of this process, where we have a given model
to infer variance. Note that this model is shared across all time steps but conditionned on that said time step.
As we can see, the reverse process consists in inferring the values of the mean and log variance for a given timestep, by providing both the sample at a given time step, and that time step that we can use to condition the models for and . Finally, obtaining samples from the model is given by running through the whole Markov chain in reverse, starting from a normal distribution to obtain samples from the target distribution. Note that this process can be very slow if we have a large number of steps, as we need to wait for a given to infer the following
Training
By using Jensen's inequality on the previous expression, we can see that the training may be performed by optimizing the variational bound on negative log-likelihood
Therefore, efficient training is allowed by optimizing random terms of with gradient descent.
Further improvements, proposed by Ho et al. [ 1 ] , come from variance reduction by rewriting as a sum of KL divergences
All the KL divergences defined in this equation compare Gaussians, which means that they have a closed-form solution.
Implementing the loss - random time steps
To optimize this loss, we will need several computational tools, notably the KL divergence between two gaussians, and the entropy of a Gaussian.
The way that the model is trained is slightly counterintuitive, since we select a timestep at random to train for each of the batch input. The implementation taken from the DDIM repo provides a form of antithetic sampling, which allows to ensure that symmetrical points in the different chains are trained jointly. Therefore, the final procedure consists in first run the forward process on each input at a given (random) time steps (performing diffusion). Then we run the reverse process on this sample, and compute the loss.
Denoising diffusion probabilistic models (DDPM)
Ho et al. [ 1 ] largely enhanced this diffusion models idea, by first proposing to rely on a new parameterization for the mean function
Note that now, the model is trained at directly estimating a form of conditional noise function, which is used in the training and sampling process. Furthermore, the authors suggest to use a fixed variance function
This leads to a new sampling procedure for the reverse process as follows (we also quickly redefine the model to output the correct dimensionality).
Simplifying loss to denoising score matching
Based on this new parameterization [ 1 ] for the mean of the reverse process The training objective can simplify to which resembles denoising score matching over multiple noise scales indexed by .
Simplified training objective
The authors proposed to largely simplify the objective, by dropping the initial factor They state that this variant is beneficial to the sample quality.
Stabilizing training with Exponential Moving Average (EMA)
This idea is found in most of the implementations, which allows to implement a form of model momentum. Instead of directly updating the weights of the model, we keep a copy of the previous values of the weights, and then update a weighted mean between the previous and new version of the weights. Here, we reuse the implementation proposed in the DDIM repository.
The training loop is finally obtained with the following code
WaveGrad - Applications to waveform generation
Recently, these ideas from score matching and diffusion models were applied to waveform generation, through a model called WaveGrad [ 5 ] . This model proposes a conditional architecture to perform estimation of the gradients of the data log-density. The paper follows almost exactly the training of Ho et al. [ 1 ] , but provides some tricks to ensure the quality of waveform generation.
Toy waveform dataset
In order to study this model, we will work with a toy dataset of simple waveforms. To do so, we generate a set of random additive sounds as defined in the following function.
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting pyo Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2d/ba/34e140d5fea5bb612b2d0a79eee55ce1a2282da1858568449f7bcf91bd76/pyo-1.0.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.9/10.9 MB 3.7 MB/s eta 0:00:0000:0100:01m Installing collected packages: pyo Successfully installed pyo-1.0.5 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WaveGrad model
The WaveGrad model is highly inspired by the DDPM model, but introduces several specificities for training on audio data. First, note that the model is trained to reconstruct a waveform by conditioning on the Mel spectrogram. Hence, this introduces an external conditioning signal , which makes WaveGrad an approach applied to learning conditional generative models of the form . The other major differences introduced by WaveGrad are
- Conditioning on the continuous noise level (), rather than the timestep index ()
- Fine-tuning the noise schedule (), depending on the number of steps ()
- Using a hierarchical sampling scheme on the continuous noise level ()
- Introducing a specific architecture for waveform generation
- Using positional encoding (from Transformers)
- Relying on downsampling and upsampling blocks
- Using FiLM layers to condition on Mel spectrogram
Introducing a conditional model
Compared to the formulation of DDPM, WaveGrad is defined as a conditional model. This implies that the full model probability is defined as
where is the set of conditioning features (for instance a Mel spectrogram). The forward (diffusion) process remains unchanged, but the reverse (trained) process becomes a condtional definition with
Each of the transitions in this process are still conditional Gaussians, but this time including an extra set of conditioning signal through . During learning, we still aim to train the mean and covariance for a Gaussian diffusion kernel
Note that in this notebook implementation, we try to avoid using conditioning features and rather aim to directly model the waveform.
Hierarchical sampling scheme - continuous noise level
Similarily to DDPM, the WaveGrad model is trained to infer the noise function, depending on the timestep in the Markov cahin. However, compared to DDPM, the model is now conditioned on the continuous noise level instead of the timestep (and also the extraneous features ). To do so, the model introduces a hierarchical sampling scheme, where we first sample the time step as Then, the noise level is also continuously sampled from a uniform function such that where and where is a predefined noise schedule (similarly to the original diffusion model definitions).
Therefore, we have to update the sampling function for the forward process, in order to account for this continuous noise sampling scheme.
Impact on the sampling function
ALthough the noise is continuously sampled in the training phase, note that there is no major changes in the sampling (inference) phase. However, we have to update the sampling function as we now need to use the model on a continuous noise level rather than the discrete timestep . Note that we also clip the output of the model, so that the sample is contained inside .
Loss function
Similarily to DDPM, the WaveGrad model trains on a loss very similar to the denoising score matching objective. This implies that the model tries to infer a form of noise function. However, compared to DDPM, the model is now conditioned on the continuous noise level instead of the timestep , and also the extraneous features . The authors also found that training with the loss instead of the provides a higher quality of samples. This leads to the following loss
Hence, the implementation is almost identical to that of DDPM, with the continuously sampled noise and the L1 loss, as shown in the following code
Model architecture
Regarding model architecture, WaveGrad proposes to use several existing mechanisms from the literature in order to solve different parts of the model.
- Using positional encoding (from Transformers)
- Relying on downsampling and upsampling blocks
- Using FiLM layers to condition on Mel spectrogram
As a first approach, we only change the previous model to include the positional encoding, and also add some simple conditional 1-dimensional convolution. This allows to test the overall approach with a simple model wore adapted to waveform data.
Training loop
The training loop with our first simplified architecture is similar to that
Advanced WaveGrad architecture
As discussed earlier, one of the major proposal in the WaveGrad paper concerns the architecture of the model. Notably, we bypassed the specific parts of
- Relying on downsampling and upsampling blocks
- Using FiLM layers to condition on Mel spectrogram
Here, we show the implementation of these blocks for defining the complete WaveGrad model, based on the implementation found in this GitHub
Basic blocks
First the WaveGrad model defines some simple blocks regarding the convolutions and modulation (with FiLM layers).
Upsampling and downsampling
The other major aspect of WaveGrad is that there is two reciprocal paths, one for upsampling the conditioning features, and the other for downsampling the
Final model
Bibliography
[1] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239.
[2] Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585.
[3] Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural computation, 23(7), 1661-1674.
[4] Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. arXiv preprint arXiv:2010.02502.
[5] Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., & Chan, W. (2020). WaveGrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713.
[6] Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr), 695-709.
[7] Song, Y., Garg, S., Shi, J., & Ermon, S. (2020, August). Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence (pp. 574-584). PMLR.
[8] Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (pp. 11918-11930).