Abstract:Multimodal generative models have recently gained significant attention for their ability to learn representations across various modalities, enhancing joint and cross-generation coherence. However, most existing works use standard Gaussian or Laplacian distributions as priors, which may struggle to capture the diverse information inherent in multiple data types due to their unimodal and less informative nature. Energy-based models (EBMs), known for their expressiveness and flexibility across various tasks, have yet to be thoroughly explored in the context of multimodal generative models. In this paper, we propose a novel framework that integrates the multimodal latent generative model with the EBM. Both models can be trained jointly through a variational scheme. This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities. Our experiments validate the proposed model, demonstrating its superior generation coherence.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to better capture and represent the data complexity and shared information across multiple modalities in multimodal generative models. Specifically, existing multimodal generative models usually use standard Gaussian or Laplace distributions as priors, and these priors are limited in expressiveness and information richness, making it difficult to capture the diverse information in multimodal data. Therefore, the paper proposes a new framework that combines energy - based models (EBM) as priors with multimodal latent generative models to improve the expressiveness of the model and the consistency of generation. ### Main problems and solutions 1. **Limitations of existing methods**: - **Prior selection**: Most existing multimodal generative models use standard Gaussian or Laplace distributions as priors, and these priors are limited in expressiveness and information richness, making it difficult to capture the diverse information in multimodal data. - **Generation consistency**: Due to the limitations of the prior, the generated samples have poor consistency and semantic coherence between different modalities. 2. **Proposed solutions**: - **Energy - based model (EBM) prior**: The paper proposes using energy - based models (EBM) as priors. EBM has higher expressiveness and flexibility and can better capture the complex distributions and shared information in multimodal data. - **Variational learning scheme**: In order to efficiently perform posterior sampling and model training, the paper introduces a variational learning scheme. By introducing an inference model to approximate the posterior distribution of the generative model, the complex MCMC sampling process is avoided. ### Main contributions of the paper 1. **Proposing an energy - based prior model**: Using energy - based models (EBM) as priors for multimodal latent generative models improves the expressiveness of the model and its ability to capture the complexity of multimodal data. 2. **Developing a variational training scheme**: A variational training scheme is proposed, enabling the generative model, the inference model, and the energy - based prior to be jointly and effectively learned. 3. **Experimental verification**: Through various experiments and ablation studies, the superior performance of the proposed method on multiple benchmark datasets is demonstrated, especially in joint consistency and cross - modal generation tasks. ### Conclusion The paper proposes a new framework that significantly improves the expressiveness of multimodal generative models and the consistency of generated samples by using energy - based models (EBM) as priors. The experimental results show that this method outperforms existing baseline models on multiple benchmark datasets, especially in joint generation and cross - modal generation tasks. Future work will further explore tighter ELBO bounds, other expressive priors, and applications on larger - scale multimodal datasets.

Learning Multimodal Latent Generative Models with Energy-Based Prior

Learning Multimodal Latent Space with EBM Prior and MCMC Inference

Learning Latent Space Energy-Based Prior Model

Learning Joint Latent Space EBM Prior Model for Multi-layer Generator.

Multimodal Latent Language Modeling with Next-Token Diffusion

Learning Hierarchical Features with Joint Latent Space Energy-Based Prior

Learning Energy-Based Prior Model with Diffusion-Amortized MCMC

Learning Latent Space Hierarchical EBM Diffusion Models

Learning Energy-based Model via Dual-MCMC Teaching

Adaptive Multi-stage Density Ratio Estimation for Learning Latent Space Energy-based Model

Latent Diffusion Energy-Based Model for Interpretable Text Modeling

Generalizing Energy-based Generative ConvNets from Particle Evolution Perspective

Generalizing Multimodal Variational Methods to Sets

Guiding Energy-based Models via Contrastive Latent Variables

Multimodal Adversarially Learned Inference with Factorized Discriminators

Learning Probabilistic Models from Generator Latent Spaces with Hat EBM

Discriminative multimodal learning via conditional priors in generative models

Energy-Based Models with Applications to Speech and Language Processing

Learning more expressive joint distributions in multimodal variational methods

Bi-level Doubly Variational Learning for Energy-based Latent Variable Models