Abstract:Combining large language models during training or at inference time has shown substantial performance gain over component LLMs. This paper presents LLM-TOPLA, a diversity-optimized LLM ensemble method with three unique properties: (i) We introduce the focal diversity metric to capture the diversity-performance correlation among component LLMs of an ensemble. (ii) We develop a diversity-optimized ensemble pruning algorithm to select the top-k sub-ensembles from a pool of $N$ base LLMs. Our pruning method recommends top-performing LLM subensembles of size $S$, often much smaller than $N$. (iii) We generate new output for each prompt query by utilizing a learn-to-ensemble approach, which learns to detect and resolve the output inconsistency among all component LLMs of an ensemble. Extensive evaluation on four different benchmarks shows good performance gain over the best LLM ensemble methods: (i) In constrained solution set problems, LLM-TOPLA outperforms the best-performing ensemble (Mixtral) by 2.2\% in accuracy on MMLU and the best-performing LLM ensemble (MoreAgent) on GSM8k by 2.1\%. (ii) In generative tasks, LLM-TOPLA outperforms the top-2 performers (Llama70b/Mixtral) on SearchQA by $3.9\mathrm{x}$ in F1, and on XSum by more than $38$ in ROUGE-1. Our code and dataset, which contains outputs of 8 modern LLMs on 4 benchmarks is available at <a class="link-external link-https" href="https://github.com/git-disl/llm-topla" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address two key issues in the integration of large language models (LLMs): 1. **How to select the best model combination from a large number of open-source or closed-source LLMs**: - Modern large language models have billions of parameters, vast training datasets, and perform well on many zero-shot and one-shot tasks. However, selecting the best model combination from numerous LLMs is a challenge. 2. **How to combine potentially conflicting outputs from multiple LLMs to achieve the best generative output for the target learning task**: - Multiple LLMs may produce different or even contradictory outputs. Effectively detecting and resolving these inconsistencies to generate high-quality final outputs is also an important issue. To address these problems, the paper proposes LLM-TOPLA, an optimized diversity LLM integration method with the following three unique features: 1. **Introducing a focus diversity metric**: - This metric is used to capture the correlation between diversity and performance among the component LLMs in the integration. 2. **Developing a diversity-optimized integration pruning algorithm**: - This algorithm selects the best top-k sub-integrations from a pool of N base LLMs. The recommended sub-integration size is usually much smaller than N but performs comparably or better. 3. **Utilizing a learning integration method to generate new outputs**: - This method learns to detect and resolve output inconsistencies among all component LLMs, generating LLM-TOPLA's output for each query. ### Experimental Results The paper demonstrates the significant performance improvement of LLM-TOPLA through extensive evaluations on four different benchmarks: 1. **In constrained solution set problems**: - LLM-TOPLA improves accuracy by 2.2% over the best integration method (Mixtral) on MMLU and by 2.1% over the best integration method (MoreAgent) on GSM8k. 2. **In generative tasks**: - LLM-TOPLA improves the F1 score by 3.9 times over the top performer (Llama70b/Mixtral) on SearchQA and improves the ROUGE-1 score by over 38% over the best method on XSum. ### Summary LLM-TOPLA effectively addresses the issues of selecting the best model combination from numerous LLMs and combining multiple LLM outputs by introducing a focus diversity metric, developing a diversity-optimized pruning algorithm, and utilizing a learning integration method, significantly enhancing the performance of integrated models.

LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

Enabling Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration.

Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks

One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

Bridging the Gap between Different Vocabularies for LLM Ensemble

EnsLM: Ensemble Language Model for Data Diversity by Semantic Clustering

Capturing Bias Diversity in LLMs

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Orchestrating LLMs with Different Personalizations

Diversity of Thought Improves Reasoning Abilities of LLMs

Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuning

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models

LoRA ensembles for large language model fine-tuning