Learning multi-scale local conditional probability models of images

Zahra Kadkhodaie,Florentin Guth,Stéphane Mallat,Eero P Simoncelli
2023-03-06
Abstract:Deep neural networks can learn powerful prior probability models for images, as evidenced by the high-quality generations obtained with recent score-based diffusion methods. But the means by which these networks capture complex global statistical structure, apparently without suffering from the curse of dimensionality, remain a mystery. To study this, we incorporate diffusion methods into a multi-scale decomposition, reducing dimensionality by assuming a stationary local Markov model for wavelet coefficients conditioned on coarser-scale coefficients. We instantiate this model using convolutional neural networks (CNNs) with local receptive fields, which enforce both the stationarity and Markov properties. Global structures are captured using a CNN with receptive fields covering the entire (but small) low-pass image. We test this model on a dataset of face images, which are highly non-stationary and contain large-scale geometric structures. Remarkably, denoising, super-resolution, and image synthesis results all demonstrate that these structures can be captured with significantly smaller conditioning neighborhoods than required by a Markov model implemented in the pixel domain. Our results show that score estimation for large complex images can be reduced to low-dimensional Markov conditional models across scales, alleviating the curse of dimensionality.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of how deep neural networks can avoid the curse of dimensionality when processing high-dimensional image data and proposes a method based on a multi-scale local conditional probability model. The core issue of the paper is to understand how deep neural networks can learn complex global statistical structures without being affected by the curse of dimensionality. To investigate this issue, the authors propose a method that integrates diffusion methods into a multi-scale decomposition framework by assuming a local Markov model of wavelet coefficients conditioned on coarser scale coefficients to reduce dimensionality. This method utilizes convolutional neural networks (CNNs) and local receptive fields to enforce this Markov property while maintaining translation invariance. Specifically, the authors' work can be summarized as follows: 1. **Problem Background**: Deep neural networks have shown excellent performance in generating high-quality images, especially in recent score diffusion methods. However, the specific mechanisms by which these networks capture complex global statistical structures remain unclear. 2. **Solution Strategy**: The paper proposes a multi-scale decomposition method, where the conditional probability distribution of wavelet coefficients is assumed to have a local Markov property to reduce dimensionality. This method leverages the local receptive field properties of CNNs to enforce translation invariance and the Markov property. 3. **Model Characteristics**: The model employs multi-scale wavelet decomposition, where wavelet coefficients at each scale are conditionally dependent on coarser scale wavelet coefficients. This conditional dependency is modeled as a local Markov random field, allowing the estimation of scores through a low-dimensional Markov conditional model, thereby alleviating the curse of dimensionality. 4. **Experimental Validation**: The paper conducts experiments on a face image dataset, demonstrating that this method can effectively capture long-range dependencies even with very small receptive fields, achieving good results in denoising, super-resolution, and image synthesis tasks. In summary, this study introduces a novel multi-scale local conditional probability model that not only addresses the curse of dimensionality faced by deep learning models when processing high-dimensional images but also demonstrates how to effectively capture complex global structures in images.