Abstract:LLM-based agents have been widely applied as personal assistants, capable of memorizing information from user messages and responding to personal queries. However, there still lacks an objective and automatic evaluation on their memory capability, largely due to the challenges in constructing reliable questions and answers (QAs) according to user messages. In this paper, we propose MemSim, a Bayesian simulator designed to automatically construct reliable QAs from generated user messages, simultaneously keeping their diversity and scalability. Specifically, we introduce the Bayesian Relation Network (BRNet) and a causal generation mechanism to mitigate the impact of LLM hallucinations on factual information, facilitating the automatic creation of an evaluation dataset. Based on MemSim, we generate a dataset in the daily-life scenario, named MemDaily, and conduct extensive experiments to assess the effectiveness of our approach. We also provide a benchmark for evaluating different memory mechanisms in LLM-based agents with the MemDaily dataset. To benefit the research community, we have released our project at <a class="link-external link-https" href="https://github.com/nuster1128/MemSim" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of objective and automatic evaluation methods for the memory ability of personal assistants based on large - language models (LLMs). Specifically, existing methods face challenges in constructing reliable questions and answers (QAs), especially in ensuring the reliability, diversity, and scalability of the generated datasets. ### Specific description of the problem 1. **Reliability**: Existing methods are easily affected by LLM hallucinations when generating datasets, resulting in inaccurate factual information. For example, in complex scenarios, the correct rate of LLM generation may be less than 40%. 2. **Diversity**: User portraits generated by LLM often lack diversity and tend to produce the most likely but unvaried user profiles. 3. **Scalability**: Manually annotating real - user messages and question - answer pairs requires a large amount of human labor and is difficult to scale. ### Solutions proposed in the paper To solve the above problems, the authors propose MemSim, a Bayesian simulator, which aims to automatically construct reliable QAs from generated user messages while maintaining their diversity and scalability. Specific methods include: - **Bayesian Relational Network (BRNet)**: Used to generate user portraits with a hierarchical structure to improve the diversity and scalability of the generated datasets. - **Causal Generation Mechanism**: By introducing causal relationships to generate various types of user messages and QAs, thereby reducing the impact of LLM hallucinations on factual information and improving the reliability of QAs. ### Experiments and evaluations Based on MemSim, the authors created a dataset named MemDaily, which covers multiple aspects of daily - life scenarios. Through extensive experiments on MemDaily, its performance under different memory mechanisms was evaluated, and a benchmark was provided for evaluating the memory abilities of different LLM agents. ### Summary The main contribution of this paper is that it proposes, for the first time, an objective and automatic method for evaluating the memory ability of personal assistants based on LLMs, which solves the deficiencies of existing methods in terms of reliability, diversity, and scalability.

MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

MemBench: Towards Real-world Evaluation of Memory-Augmented Dialogue Systems

MemoryBank: Enhancing Large Language Models with Long-Term Memory

A Survey on the Memory Mechanism of Large Language Model based Agents

MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation

Memoro: Using Large Language Models to Realize a Concise Interface for Real-Time Memory Augmentation

Memory Sharing for Large Language Model based Agents

FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design

LDM$^2$: A Large Decision Model Imitating Human Cognition with Dynamic Memory Enhancement

"My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents

Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

Empowering Working Memory for Large Language Model Agents

ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Personalized Large Language Model Assistant with Evolving Conditional Memory

Human Simulacra: Benchmarking the Personification of Large Language Models

PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering

LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination

MaxMind: A Memory Loop Network to Enhance Software Productivity based on Large Language Models

SimulBench: Evaluating Language Models with Creative Simulation Tasks