Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang,Jue Wang,Ben Athiwaratkun,Ce Zhang,James Zou
2024-06-07
Abstract:Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.
Computation and Language
What problem does this paper attempt to address?
This paper proposes a method called Mixture-of-Agents (MoA) to address the problem of effectively utilizing the collective abilities of multiple large language models (LLMs). With the advancement of LLMs in natural language understanding and generation tasks, integrating the expertise of these models has become a challenge. The paper found that even if the quality of outputs from other models is lower, an LLM often produces better responses after considering these outputs, which is referred to as the collaboration of LLMs. The MoA method achieves state-of-the-art performance on benchmark tests such as AlpacaEval 2.0, MT-Bench, and FLASK by constructing a multi-layered structure, where each layer consists of multiple LLM agents that use the outputs of all agents in the previous layer as auxiliary information to generate their responses. For example, using only open-source LLMs, MoA achieves a score of 65.1% on AlpacaEval 2.0, surpassing GPT-4 Omni's score of 57.5%. The paper also emphasizes the importance of selecting LLMs with diversity to promote collaboration and improve the overall response quality. Selection criteria include performance metrics and output diversity. By combining these factors, MoA mitigates the deficiencies of individual models and enhances the overall response quality. Experimental results demonstrate the excellent performance of MoA on multiple benchmark tests, showcasing its effectiveness and potential advantages, especially in improving the reasoning and language generation capabilities of LLMs. Furthermore, the paper also explores the connection between MoA and the Mixture-of-Experts method, but MoA operates at the model level rather than the activation level, allowing it to leverage the interfaces of existing LLMs without internal modifications.