BASES: Large-scale Web Search User Simulation with Large Language Model based Agents

Ruiyang Ren,Peng Qiu,Yingqi Qu,Jing Liu,Wayne Xin Zhao,Hua Wu,Ji-Rong Wen,Haifeng Wang
2024-02-27
Abstract:Due to the excellent capacities of large language models (LLMs), it becomes feasible to develop LLM-based agents for reliable user simulation. Considering the scarcity and limit (e.g., privacy issues) of real user data, in this paper, we conduct large-scale user simulation for web search, to improve the analysis and modeling of user search behavior. Specially, we propose BASES, a novel user simulation framework with LLM-based agents, designed to facilitate comprehensive simulations of web search user behaviors. Our simulation framework can generate unique user profiles at scale, which subsequently leads to diverse search behaviors. To demonstrate the effectiveness of BASES, we conduct evaluation experiments based on two human benchmarks in both Chinese and English, demonstrating that BASES can effectively simulate large-scale human-like search behaviors. To further accommodate the research on web search, we develop WARRIORS, a new large-scale dataset encompassing web search user behaviors, including both Chinese and English versions, which can greatly bolster research in the field of information retrieval. Our code and data will be publicly released soon.
Information Retrieval,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of large-scale user behavior simulation in web search. Specifically, the paper proposes a novel user simulation framework called BASES, based on large language model (LLM) agents, to generate diverse user profiles and precise personalized user behaviors. In this way, BASES can effectively simulate large-scale human search behaviors, thereby improving the analysis and modeling of user search behaviors. ### Background and Motivation 1. **Scarcity and Limitations of Real User Data**: - The acquisition cost of real user data is high and there are privacy concerns. - The quality and completeness of the data may affect the accuracy of the analysis. 2. **Importance of User Behavior Simulation**: - By simulating user behavior, we can better understand users' search needs, thereby developing more effective search systems. - Simulating user behavior can reduce the reliance on real user experiments, improving the efficiency and reliability of research. ### Solution 1. **BASES Framework**: - **User Profile Construction**: Designed a user profile structure that includes static and dynamic attributes, ensuring each simulated user has a unique profile. - **LLM Agent**: Utilized large language model agents to simulate users' query and click behaviors, generating precise and personalized user behaviors. - **Behavior Prompt Strategies**: Designed query behavior prompts and click behavior prompt strategies to ensure the agent can effectively simulate human user behaviors. 2. **Evaluation and Validation**: - Evaluated on both Chinese and English benchmark datasets, proving the effectiveness of BASES. - By generating large-scale simulated user behavior data, constructed a new large-scale dataset WARRIORS, covering both Chinese and English versions of user search behaviors. ### Main Contributions 1. **Proposed BASES Framework**: A large-scale web search user behavior simulation framework based on LLM agents, capable of generating diverse user profiles and precise user behaviors. 2. **Performance Improvement**: In multiple information retrieval tasks, the BASES framework significantly improved model performance, especially in low-resource scenarios. 3. **Constructed WARRIORS Dataset**: Systematically collected and organized large-scale simulated user search behavior data, releasing a new dataset WARRIORS containing 100,000 user search sessions. ### Conclusion Through the BASES framework, researchers can efficiently and accurately simulate large-scale web search user behaviors, thereby promoting research and development in the field of information retrieval. The release of the WARRIORS dataset will further facilitate the progress of related research.