MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants

Zeyu Zhang,Quanyu Dai,Luyu Chen,Zeren Jiang,Rui Li,Jieming Zhu,Xu Chen,Yi Xie,Zhenhua Dong,Ji-Rong Wen
2024-09-30
Abstract:LLM-based agents have been widely applied as personal assistants, capable of memorizing information from user messages and responding to personal queries. However, there still lacks an objective and automatic evaluation on their memory capability, largely due to the challenges in constructing reliable questions and answers (QAs) according to user messages. In this paper, we propose MemSim, a Bayesian simulator designed to automatically construct reliable QAs from generated user messages, simultaneously keeping their diversity and scalability. Specifically, we introduce the Bayesian Relation Network (BRNet) and a causal generation mechanism to mitigate the impact of LLM hallucinations on factual information, facilitating the automatic creation of an evaluation dataset. Based on MemSim, we generate a dataset in the daily-life scenario, named MemDaily, and conduct extensive experiments to assess the effectiveness of our approach. We also provide a benchmark for evaluating different memory mechanisms in LLM-based agents with the MemDaily dataset. To benefit the research community, we have released our project at <a class="link-external link-https" href="https://github.com/nuster1128/MemSim" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current lack of objective and automatic evaluation methods for the memory ability of personal assistants based on large - language models (LLMs). Specifically, existing methods face challenges in constructing reliable questions and answers (QAs), especially in ensuring the reliability, diversity, and scalability of the generated datasets. ### Specific description of the problem 1. **Reliability**: Existing methods are easily affected by LLM hallucinations when generating datasets, resulting in inaccurate factual information. For example, in complex scenarios, the correct rate of LLM generation may be less than 40%. 2. **Diversity**: User portraits generated by LLM often lack diversity and tend to produce the most likely but unvaried user profiles. 3. **Scalability**: Manually annotating real - user messages and question - answer pairs requires a large amount of human labor and is difficult to scale. ### Solutions proposed in the paper To solve the above problems, the authors propose MemSim, a Bayesian simulator, which aims to automatically construct reliable QAs from generated user messages while maintaining their diversity and scalability. Specific methods include: - **Bayesian Relational Network (BRNet)**: Used to generate user portraits with a hierarchical structure to improve the diversity and scalability of the generated datasets. - **Causal Generation Mechanism**: By introducing causal relationships to generate various types of user messages and QAs, thereby reducing the impact of LLM hallucinations on factual information and improving the reliability of QAs. ### Experiments and evaluations Based on MemSim, the authors created a dataset named MemDaily, which covers multiple aspects of daily - life scenarios. Through extensive experiments on MemDaily, its performance under different memory mechanisms was evaluated, and a benchmark was provided for evaluating the memory abilities of different LLM agents. ### Summary The main contribution of this paper is that it proposes, for the first time, an objective and automatic method for evaluating the memory ability of personal assistants based on LLMs, which solves the deficiencies of existing methods in terms of reliability, diversity, and scalability.