Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

Swarnadeep Saha,Omer Levy,Asli Celikyilmaz,Mohit Bansal,Jason Weston,Xian Li
2024-06-08
Abstract:Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA2-chat to match or outperform GPT-4 on most domains. On a constraint story generation task, BSM improves the coherence of stories while also improving constraint satisfaction by 12%.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key issues that large language models (LLMs) face when handling multifaceted language generation and evaluation tasks. Specifically: 1. **Insufficient Capability in Handling Complex Tasks**: - Current LLMs perform poorly when dealing with tasks that require satisfying multiple constraints or considering multiple aspects and standards. These issues often involve complex user requirements, such as generating text that meets specific criteria or conducting multidimensional evaluations of generated text. 2. **Lack of Coherence and Planning Ability**: - LLMs often lack coherence and fail to effectively plan and decompose problems when handling these tasks. This leads to logical inconsistencies or deviations from the topic when generating long-form answers. 3. **Evaluation Bias**: - There are various biases when using LLMs for automatic evaluation, such as positional bias (changing evaluation results based on the order of responses), length bias (favoring longer responses), and self-enhancement bias (favoring their own responses). ### Proposed Solution To address the above issues, the authors propose the BRANCH-SOLVE-MERGE (BSM) method. BSM is a programmatic approach based on large language models that addresses complex natural language tasks through the following three modules: 1. **Branch Module**: - Decomposes the task into multiple parallel subtasks. Each subtask is represented by a unique branch, representing different components needed to solve the overall problem. 2. **Solve Module**: - Independently solves each subtask. Each subtask is parameterized into the base LLM through specific prompts, generating corresponding solutions. 3. **Merge Module**: - Integrates the solutions of the subtasks to generate the final overall solution. This module is parameterized into the base LLM through specific prompts to ensure the coherence and consistency of the solution. ### Application Cases The authors apply the BSM method to two challenging tasks: 1. **LLM Output Evaluation**: - Evaluates the quality of responses generated by LLMs, especially in multi-turn dialogues. BSM improves the accuracy and consistency of evaluations by generating evaluation plans, independently evaluating each subtask, and merging the results. 2. **Constrained Text Generation**: - Generates stories that meet specific constraints. BSM improves the coherence and constraint satisfaction of the stories by decomposing the story into multiple subtasks, independently generating parts of the story for each subtask, and finally merging them into a complete story. ### Experimental Results Experimental results show that BSM significantly improves task performance across multiple LLMs, including: - **Improved Evaluation Accuracy and Consistency**: - In the MT-Bench benchmark test, BSM improved the consistency between LLM and human evaluations, reducing positional bias, length bias, and self-enhancement bias. - **Improved Quality of Constrained Text Generation**: - In the constrained story generation task, BSM-generated stories were more coherent and better met the constraints, with a 12% improvement in constraint satisfaction. Overall, BSM provides an effective approach to solving complex multifaceted language generation and evaluation tasks through task decomposition and planning.