Abstract:Retrieval-augmented language models show promise in addressing issues like outdated information and hallucinations in language models (LMs). However, current research faces two main problems: 1) determining what information to retrieve, and 2) effectively combining retrieved information during generation. We argue that valuable retrieved information should not only be related to the current source text but also consider the future target text, given the nature of LMs that model future tokens. Moreover, we propose that aggregation using latent variables derived from a compact latent space is more efficient than utilizing explicit raw text, which is limited by context length and susceptible to noise. Therefore, we introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE). It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Additionally, we leverage the VAE to initialize the latent space and adopt the probabilistic form of the retrieval generation paradigm by expanding the Gaussian prior distribution into a Gaussian mixture distribution. Theoretical analysis provides an optimizable upper bound for RegaVAE. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling" aims to address several key issues in existing language models (LMs): 1. **Outdated Information and Hallucination**: Existing language models may generate outdated information or hallucinated content when producing text. These issues affect the reliability and accuracy of the models. 2. **Effective Utilization of Retrieved Information**: Current retrieval-augmented language models face two main challenges: - **Determining What Information to Retrieve**: How to select retrieval information that is relevant to the current source text and useful for the future target text. - **Effectively Integrating Retrieved Information**: How to effectively integrate the retrieved information during the generation process, especially considering the future target text. ### Solution To address the above issues, the authors propose RegaVAE, a retrieval-augmented language model based on the Gaussian Mixture Variational Auto-Encoder (GM-VAE). The main contributions of RegaVAE include: 1. **Implicit Integration of Current and Future Information**: By introducing a compact latent space, RegaVAE can implicitly integrate current and future information, ensuring that the retrieved documents are not only relevant to the current source text but also useful for the future target text. 2. **Efficient Aggregation of Retrieved Information**: RegaVAE implicitly aggregates the retrieved information and the source text into the generation process, avoiding the length limitations and noise issues of explicit aggregation methods. Specifically, the model extends the Gaussian prior distribution to a Gaussian mixture distribution, ensuring continuity and uniformity in the latent space, thereby improving the quality and diversity of the generated text. 3. **Optimized Framework**: The authors derive an optimizable upper bound for training RegaVAE, ensuring the model performs well in terms of generation quality, diversity, and reducing hallucinations. ### Experimental Results Experimental results show that RegaVAE significantly improves the quality of text generation and reduces hallucinations across multiple datasets. Specific metrics include Perplexity (PPL), Self-BLEU, Dist2, and Activated Units (AU). Additionally, human evaluations confirm RegaVAE's superior performance in fluency, coherence, diversity, and reducing hallucinations. ### Conclusion RegaVAE effectively addresses the issues of outdated information and hallucinations in existing language models by introducing a compact latent space and Gaussian mixture distribution, achieving significant improvements in generation quality and diversity.

RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling

Improving Variational Autoencoders with Density Gap-based Regularization

Dispersed EM-VAEs for Interpretable Text Generation

Fixing Gaussian Mixture VAEs for Interpretable Text Generation

Dispersed Exponential Family Mixture VAEs for Interpretable Text Generation

AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling

Advanced Conditional Variational Autoencoders (A-CVAE): Towards interpreting open-domain conversation generation via disentangling latent feature representation

HGMVAE: hierarchical disentanglement in Gaussian mixture variational autoencoder

VAE-Stega: Linguistic Steganography Based on Variational Auto-Encoder

Retrieval-Augmented Generation for Large Language Models: A Survey

On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond

LlaMaVAE: Guiding Large Language Model Generation via Continuous Latent Sentence Spaces

Neural Gaussian Copula for Variational Autoencoder

Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation

Variational Auto-Decoder: A Method for Neural Generative Modeling from Incomplete Data

Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck

eVAE: Evolutionary Variational Autoencoder

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Entropy-Based Decoding for Retrieval-Augmented Large Language Models

VAEGAN: A Collaborative Filtering Framework based on Adversarial Variational Autoencoders