Abstract:Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement.

What problem does this paper attempt to address?

The problem this paper attempts to address is the difficulty in integrating different large language models (LLMs) due to their vocabulary differences. Specifically, existing ensemble methods typically can only select or merge fully generated outputs, without the ability to dynamically correct and enhance outputs during the generation process, which limits the effectiveness of integration. To solve this problem, the authors propose a new method—Ensemble via Vocabulary Alignment (EVA). EVA can bridge the vocabulary gap between different LLMs, enabling fine-grained integration at each generation step. ### Main Contributions 1. **Fine-Grained Integration**: A new LLM integration method is proposed, which enables fine-grained integration at each generation step, thereby unleashing the complementary potential of different LLMs. 2. **Filtering Strategy**: An effective filtering strategy is designed to exclude models that generate unfaithful tokens, preventing poorly performing models from misleading the overall judgment. 3. **Experimental Proof**: Experimental results show that this method significantly improves overall performance in various natural language processing tasks, outperforming individual LLMs and previous ensemble methods based on fully generated outputs. ### Method Overview 1. **Cross-Model Vocabulary Alignment**: - **Vocabulary Projection**: Using overlapping tokens as supervision labels, the word embeddings of different models are mapped to a common vector space. - **Noise Reduction**: Noise is reduced through three steps, retaining relevant and concise alignment information. 2. **LLMs Integration**: - **Output Distribution Alignment**: Using the established vocabulary relationships, the output distribution of non-pivot models is aligned to the pivot model's space. - **Filtering Strategy**: Ensures the generated tokens are consistent, excluding models that generate unfaithful tokens. ### Experimental Setup - **Datasets**: Includes tasks such as machine translation, data-to-text generation, commonsense reasoning, and arithmetic reasoning. - **Candidate LLMs**: Seven open-source chat LLMs of approximately 7B size were selected. - **Baseline Methods**: Compared with existing selection-based and fusion-based methods. ### Experimental Results - **NLG Tasks**: In machine translation and data-to-text generation tasks, EVA significantly outperforms individual LLMs and previous ensemble methods. - **Reasoning Tasks**: In commonsense reasoning and arithmetic reasoning tasks, EVA also performs excellently, especially in the GSM8K task, improving by 10.61% compared to the best single model. ### Analysis - **Impact of Model Filtering Intensity**: Different tasks have varying sensitivity to model filtering intensity. For example, arithmetic reasoning tasks are very sensitive to filtering intensity, while other tasks are less so. - **Impact of the Number of Integrated Models**: As the number of integrated models increases, EVA's performance continues to improve, indicating that different models have unique knowledge that can further enhance performance through integration. ### Related Work - **Selection-Based Ensemble**: Selects the best output from multiple outputs but is limited by the quality of the candidate models' outputs. - **Fusion-Based Ensemble**: Bypasses the limitations of existing complete outputs, usually generating better outputs. Through these methods and analyses, EVA successfully addresses the integration challenges caused by vocabulary differences between different LLMs, achieving fine-grained integration and significant performance improvements in various natural language processing tasks.

Bridging the Gap between Different Vocabularies for LLM Ensemble

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

Enabling Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration.

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

Hit the Sweet Spot! Span-Level Ensemble for Large Language Models

How Vocabulary Sharing Facilitates Multilingualism in LLaMA?

Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

LLM-Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs

CharED: Character-wise Ensemble Decoding for Large Language Models

EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

Unified Lexical Representation for Interpretable Visual-Language Alignment

One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

Unlocking the Power of Large Language Models for Entity Alignment

LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Entity Alignment with Noisy Annotations from Large Language Models

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding