Abstract:Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA2-chat to match or outperform GPT-4 on most domains. On a constraint story generation task, BSM improves the coherence of stories while also improving constraint satisfaction by 12%.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key issues that large language models (LLMs) face when handling multifaceted language generation and evaluation tasks. Specifically: 1. **Insufficient Capability in Handling Complex Tasks**: - Current LLMs perform poorly when dealing with tasks that require satisfying multiple constraints or considering multiple aspects and standards. These issues often involve complex user requirements, such as generating text that meets specific criteria or conducting multidimensional evaluations of generated text. 2. **Lack of Coherence and Planning Ability**: - LLMs often lack coherence and fail to effectively plan and decompose problems when handling these tasks. This leads to logical inconsistencies or deviations from the topic when generating long-form answers. 3. **Evaluation Bias**: - There are various biases when using LLMs for automatic evaluation, such as positional bias (changing evaluation results based on the order of responses), length bias (favoring longer responses), and self-enhancement bias (favoring their own responses). ### Proposed Solution To address the above issues, the authors propose the BRANCH-SOLVE-MERGE (BSM) method. BSM is a programmatic approach based on large language models that addresses complex natural language tasks through the following three modules: 1. **Branch Module**: - Decomposes the task into multiple parallel subtasks. Each subtask is represented by a unique branch, representing different components needed to solve the overall problem. 2. **Solve Module**: - Independently solves each subtask. Each subtask is parameterized into the base LLM through specific prompts, generating corresponding solutions. 3. **Merge Module**: - Integrates the solutions of the subtasks to generate the final overall solution. This module is parameterized into the base LLM through specific prompts to ensure the coherence and consistency of the solution. ### Application Cases The authors apply the BSM method to two challenging tasks: 1. **LLM Output Evaluation**: - Evaluates the quality of responses generated by LLMs, especially in multi-turn dialogues. BSM improves the accuracy and consistency of evaluations by generating evaluation plans, independently evaluating each subtask, and merging the results. 2. **Constrained Text Generation**: - Generates stories that meet specific constraints. BSM improves the coherence and constraint satisfaction of the stories by decomposing the story into multiple subtasks, independently generating parts of the story for each subtask, and finally merging them into a complete story. ### Experimental Results Experimental results show that BSM significantly improves task performance across multiple LLMs, including: - **Improved Evaluation Accuracy and Consistency**: - In the MT-Bench benchmark test, BSM improved the consistency between LLM and human evaluations, reducing positional bias, length bias, and self-enhancement bias. - **Improved Quality of Constrained Text Generation**: - In the constrained story generation task, BSM-generated stories were more coherent and better met the constraints, with a 12% improvement in constraint satisfaction. Overall, BSM provides an effective approach to solving complex multifaceted language generation and evaluation tasks through task decomposition and planning.

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency

$\texttt{LM}^\texttt{2}$: A Simple Society of Language Models Solves Complex Reasoning

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Small Language Models Fine-tuned to Coordinate Larger Language Models improve Complex Reasoning

Large Language Models Can Self-Improve in Long-context Reasoning

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Supervised Knowledge Makes Large Language Models Better In-context Learners

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Large Language Models in Computer Science Education: A Systematic Literature Review

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Large Language Models Synergize with Automated Machine Learning

Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Spoken Language Intelligence of Large Language Models for Language Learning

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Large Language Models: A Survey

An Evaluation of Large Language Models in Bioinformatics Research