$\text{Memory}^3$: Language Modeling with Explicit Memory

Hongkang Yang,Zehao Lin,Wenjin Wang,Hao Wu,Zhiyu Li,Bo Tang,Wenqiang Wei,Jinbo Wang,Zeyun Tang,Shichao Song,Chenyang Xi,Yu Yu,Kai Chen,Feiyu Xiong,Linpeng Tang,Weinan E
2024-07-01
Abstract:The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper proposes a new approach to optimize the knowledge storage of large-scale language models (LLMs) in order to reduce the cost of training and inference. Inspired by the hierarchical structure of human brain memory, the researchers introduce the concept of "explicit memory," which is more economical compared to model parameters and text retrieval enhanced generation (RAG). By externalizing most of the knowledge into explicit memory, the size of model parameters, training cost, and inference cost can be reduced, which are proportional to the remaining "abstract knowledge." The paper introduces a 2.4B-parameter scale LLM model called Memory3, which outperforms larger-scale LLM models and RAG models in performance and is superior to RAG in inference speed. Memory3 converts text into retrievable explicit memory and recalls these memories during inference. The researchers propose memory sparsification mechanisms to make storage feasible, as well as a two-stage pre-training scheme to facilitate memory formation. The paper also presents a memory hierarchy theory, which divides memory formats into implicit memory (model parameters), explicit memory, and external information (RAG). It discusses how to store knowledge in different memory levels based on its frequency of use in order to minimize cost. Explicit memory allows LLM to store and retrieve knowledge more efficiently, reducing the demand for large-scale models and improving knowledge efficiency. Furthermore, explicit memory helps address the problem of knowledge traversal, where LLM inefficiently invokes all parameters every time it generates a token. The authors compare LLM without explicit memory to patients who, due to damage, cannot learn semantic knowledge but can learn skills through repeated prompts, implying that there is room for improvement in LLM training efficiency. The Memory3 model demonstrates improved factual accuracy and ability to adapt to specialized tasks, such as processing unlimited contexts, memory consolidation, and enhancing factuality and interpretability. The paper trains a Memory3 model from scratch, which performs better than larger SOTA models and outperforms RAG models in performance and inference speed.