SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Dong Zhang,Zhaowei Li,Pengyu Wang,Xin Zhang,Yaqian Zhou,Xipeng Qiu
2024-01-08
Abstract:Human communication is a complex and diverse process that not only involves multiple factors such as language, commonsense, and cultural backgrounds but also requires the participation of multimodal information, such as speech. Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society. Can we leverage LLM-based multi-agent systems to simulate human communication? However, current LLM-based multi-agent systems mainly rely on text as the primary medium. In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. SpeechAgents utilizes multi-modal LLM as the control center for individual agent and employes multi-modal signals as the medium for exchanged messages among agents. Additionally, we propose Multi-Agent Tuning to enhance the multi-agent capabilities of LLM without compromising general abilities. To strengthen and evaluate the effectiveness of human communication simulation, we build the Human-Communication Simulation Benchmark. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation. Code and models will be open-sourced at https://github. com/0nutation/SpeechAgents
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use multi - agent systems based on large - language models (LLMs) to simulate human multi - modal communication. Current LLMs multi - agent systems mainly rely on text as the main medium for information exchange and lack the ability to perceive and generate multi - modal signals. The paper proposes a new multi - modal LLMs multi - agent system - SpeechAgents, aiming to simulate human communication through multi - modal signals such as voice, thereby enhancing the authenticity and richness of communication. Specifically, the paper focuses on the following aspects: 1. **Use of multi - modal signals**: Current multi - agent systems mainly rely on text, while human communication is a multi - modal process involving multiple factors such as language, emotion, non - verbal expression, and cultural background. The paper proposes using multi - modal signals (such as voice) as the medium for information exchange between agents to more realistically simulate human communication. 2. **Enhancement of multi - agent capabilities**: In order to improve the performance of LLMs in multi - agent environments, the paper proposes the multi - agent tuning method to enhance the multi - agent capabilities of LLMs without compromising their general capabilities. 3. **Establishment of evaluation criteria**: In order to evaluate the effectiveness of multi - modal human communication simulation, the paper constructs the "Human - Communication Simulation Benchmark" and evaluates the performance of different systems through multiple indicators. Through these methods, the paper aims to address the deficiencies of existing LLMs multi - agent systems in multi - modal communication simulation and promote the development of related technologies.