Abstract:We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. Our results demonstrate that PairRanker exhibits the highest correlation with ChatGPT-based ranking. Then, GenFuser aims to merge the top-ranked candidates, generating an improved output by capitalizing on their strengths and mitigating their weaknesses. To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons. Our LLM-Blender significantly outperform individual LLMs and baseline methods across various metrics, establishing a substantial performance gap.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to achieve consistent and superior performance in instruction-following tasks by integrating multiple open-source large language models (LLMs). Specifically, the paper proposes a framework called LLM-BLENDER, which aims to leverage the diverse strengths of different LLMs to achieve better performance than using any single model alone. LLM-BLENDER consists of two modules: PAIRRANKER and GENFUSER. 1. **PAIRRANKER**: This module distinguishes subtle differences between candidate outputs through a specialized pairwise comparison method to determine which output is better. It jointly encodes the input text and a pair of candidate outputs, using a cross-attention encoder to decide which one is superior. The results of PAIRRANKER show that it has the highest correlation with ChatGPT-based rankings. 2. **GENFUSER**: The goal of this module is to merge the top-ranked candidate outputs to generate an improved final output by leveraging the strengths of these candidates and reducing their weaknesses. To facilitate large-scale evaluation, the authors also introduce a benchmark dataset called MixInstruct, which is a mixture of various instruction datasets with authoritative pairwise comparison results. The main contributions of the paper are: - Proposing an effective framework that can dynamically combine the outputs of multiple LLMs to generate consistent and higher-quality responses for each input. - Experimentally validating that LLM-BLENDER significantly outperforms individual LLMs and baseline methods on multiple evaluation metrics, particularly excelling in metrics such as GPT-Rank, BERTScore, and BARTScore. - Highlighting the significant performance differences of different LLMs on different examples, with no single open-source LLM performing best in all cases, thus emphasizing the importance of dynamically integrating these models to improve overall performance. In summary, the paper addresses the problem of effectively integrating multiple LLMs to enhance performance in instruction-following tasks by proposing the LLM-BLENDER framework, providing new insights and tools for future research and applications.

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

Enabling Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration.

ProFuser: Progressive Fusion of Large Language Models

Cool-Fusion: Fuse Large Language Models without Training

CharED: Character-wise Ensemble Decoding for Large Language Models

Bridging the Gap between Different Vocabularies for LLM Ensemble

ToBlend: Token-Level Blending With an Ensemble of LLMs to Attack AI-Generated Text Detection

Knowledge Fusion of Large Language Models

JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

On-the-Fly Fusion of Large Language Models and Machine Translation

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

Mixture-of-Agents Enhances Large Language Model Capabilities

Supervised Knowledge Makes Large Language Models Better In-context Learners

LLM Chain Ensembles for Scalable and Accurate Data Annotation

Fusion-Eval: Integrating Evaluators with LLMs

A Two-Stage Adaptation of Large Language Models for Text Ranking