Abstract:Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce \textbf{G}enerative \textbf{M}odeling with \textbf{E}xplicit \textbf{M}emory (GMem), leveraging an external memory bank in both training and sampling phases of diffusion models. This approach preserves semantic information from data distributions, reducing reliance on neural network capacity for learning and generalizing across diverse datasets. The results are significant: our GMem enhances both training, sampling efficiency, and generation quality. For instance, on ImageNet at $256 \times 256$ resolution, GMem accelerates SiT training by over $46.7\times$, achieving the performance of a SiT model trained for $7M$ steps in fewer than $150K$ steps. Compared to the most efficient existing method, REPA, GMem still offers a $16\times$ speedup, attaining an FID score of 5.75 within $250K$ steps, whereas REPA requires over $4M$ steps. Additionally, our method achieves state-of-the-art generation quality, with an FID score of {3.56} without classifier-free guidance on ImageNet $256\times256$. Our code is available at \url{<a class="link-external link-https" href="https://github.com/LINs-lab/GMem" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high computational burden and inefficiency in the training and sampling processes of diffusion models. Specifically, diffusion models perform well in generating high - quality and realistic data, but their training and inference processes require large neural network capacity and involve multi - step sampling, resulting in high computational costs. These problems have become the main bottlenecks in the development of diffusion models. To solve these problems, the authors propose the "Generative Modeling with Explicit Memory (GM EM)" method. This method reduces the memory burden of neural networks and improves the efficiency of training and sampling by introducing an external memory bank to store semantic information in the data distribution. ### Main contributions: 1. **Separation of memory and generalization**: The authors propose that the functions of diffusion models can be divided into two parts: memorizing semantic information and generalizing to the real data distribution. Using neural networks for memory will bring significant computational and model - capacity burdens. 2. **Introduction of an external memory bank**: GM EM significantly improves the efficiency of training and sampling by constructing an external memory bank to store semantic information in the data distribution, thereby reducing the demand for neural network capacity. 3. **Improvement of efficiency and quality**: The experimental results show that GM EM not only improves the efficiency of training and sampling but also achieves state - of - the - art generation quality on multiple benchmark datasets. ### Specific implementation: - **External memory bank design**: The memory bank is designed as a matrix $ M\in\mathbb{R}^{n\times m} $, and each memory segment $ s $ satisfies $ \|s\| = 1 $. The memory bank ensures that all semantic information can be captured by optimizing the objective function. - **Training objective**: By modifying the loss function of the diffusion model, semantic information is incorporated into the training process. The specific formula is: \[ L(\theta)=\int_{0}^{T}\mathbb{E}\left\|v_{\theta}(x_{t},s,t)-\dot{\alpha}_{t}x_{0}-\dot{\sigma}_{t}\epsilon\right\|^{2}dt \] where $ x_{0}\sim D $, $ \epsilon\sim\mathcal{N}(0,I) $, $ \dot{\alpha}_{t}=\frac{d\alpha_{t}}{dt} $, $ \dot{\sigma}_{t}=\frac{d\sigma_{t}}{dt} $. - **Sampling process**: In the sampling stage, GM EM generates the final image by converting Gaussian noise into an index distribution and selecting the corresponding memory segments from the memory bank. ### Experimental results: - **CIFAR - 10 dataset**: GM EM achieves an FID score (FID = 1.22) comparable to that of traditional generative models with fewer training steps. - **ImageNet 64×64 and 256×256 datasets**: GM EM achieves FID = 2.10 on ImageNet 64×64 and FID = 3.56 on ImageNet 256×256, significantly outperforming existing methods. In conclusion, this paper effectively solves the high computational burden problem in the training and sampling processes of diffusion models by introducing an external memory bank, and significantly improves the efficiency and quality of generative models.

Generative Modeling with Explicit Memory

On Memorization in Diffusion Models

Emage: Non-Autoregressive Text-to-Image Generation

Towards Memorization-Free Diffusion Models

Learning to Generate with Memory

Detecting, Explaining, and Mitigating Memorization in Diffusion Models

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

Losing dimensions: Geometric memorization in generative diffusion

Towards a Theoretical Understanding of Memorization in Diffusion Models

Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication

MemGEN: Memory is All You Need

Embedding Space Selection for Detecting Memorization and Fingerprinting in Generative Models

MemControl: Mitigating Memorization in Diffusion Models via Automated Parameter Selection

A Geometric Framework for Understanding Memorization in Generative Models

Memory-Free Generative Replay For Class-Incremental Learning

Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models

An Inversion-based Measure of Memorization for Diffusion Models

ResMem: Learn what you can and memorize the rest

DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think