Generative Modeling with Explicit Memory

Yi Tang,Peng Sun,Zhenglin Cheng,Tao Lin
2024-12-12
Abstract:Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce \textbf{G}enerative \textbf{M}odeling with \textbf{E}xplicit \textbf{M}emory (GMem), leveraging an external memory bank in both training and sampling phases of diffusion models. This approach preserves semantic information from data distributions, reducing reliance on neural network capacity for learning and generalizing across diverse datasets. The results are significant: our GMem enhances both training, sampling efficiency, and generation quality. For instance, on ImageNet at $256 \times 256$ resolution, GMem accelerates SiT training by over $46.7\times$, achieving the performance of a SiT model trained for $7M$ steps in fewer than $150K$ steps. Compared to the most efficient existing method, REPA, GMem still offers a $16\times$ speedup, attaining an FID score of 5.75 within $250K$ steps, whereas REPA requires over $4M$ steps. Additionally, our method achieves state-of-the-art generation quality, with an FID score of {3.56} without classifier-free guidance on ImageNet $256\times256$. Our code is available at \url{<a class="link-external link-https" href="https://github.com/LINs-lab/GMem" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high computational burden and inefficiency in the training and sampling processes of diffusion models. Specifically, diffusion models perform well in generating high - quality and realistic data, but their training and inference processes require large neural network capacity and involve multi - step sampling, resulting in high computational costs. These problems have become the main bottlenecks in the development of diffusion models. To solve these problems, the authors propose the "Generative Modeling with Explicit Memory (GM EM)" method. This method reduces the memory burden of neural networks and improves the efficiency of training and sampling by introducing an external memory bank to store semantic information in the data distribution. ### Main contributions: 1. **Separation of memory and generalization**: The authors propose that the functions of diffusion models can be divided into two parts: memorizing semantic information and generalizing to the real data distribution. Using neural networks for memory will bring significant computational and model - capacity burdens. 2. **Introduction of an external memory bank**: GM EM significantly improves the efficiency of training and sampling by constructing an external memory bank to store semantic information in the data distribution, thereby reducing the demand for neural network capacity. 3. **Improvement of efficiency and quality**: The experimental results show that GM EM not only improves the efficiency of training and sampling but also achieves state - of - the - art generation quality on multiple benchmark datasets. ### Specific implementation: - **External memory bank design**: The memory bank is designed as a matrix \( M\in\mathbb{R}^{n\times m} \), and each memory segment \( s \) satisfies \( \|s\| = 1 \). The memory bank ensures that all semantic information can be captured by optimizing the objective function. - **Training objective**: By modifying the loss function of the diffusion model, semantic information is incorporated into the training process. The specific formula is: \[ L(\theta)=\int_{0}^{T}\mathbb{E}\left\|v_{\theta}(x_{t},s,t)-\dot{\alpha}_{t}x_{0}-\dot{\sigma}_{t}\epsilon\right\|^{2}dt \] where \( x_{0}\sim D \), \( \epsilon\sim\mathcal{N}(0,I) \), \( \dot{\alpha}_{t}=\frac{d\alpha_{t}}{dt} \), \( \dot{\sigma}_{t}=\frac{d\sigma_{t}}{dt} \). - **Sampling process**: In the sampling stage, GM EM generates the final image by converting Gaussian noise into an index distribution and selecting the corresponding memory segments from the memory bank. ### Experimental results: - **CIFAR - 10 dataset**: GM EM achieves an FID score (FID = 1.22) comparable to that of traditional generative models with fewer training steps. - **ImageNet 64×64 and 256×256 datasets**: GM EM achieves FID = 2.10 on ImageNet 64×64 and FID = 3.56 on ImageNet 256×256, significantly outperforming existing methods. In conclusion, this paper effectively solves the high computational burden problem in the training and sampling processes of diffusion models by introducing an external memory bank, and significantly improves the efficiency and quality of generative models.