Abstract:Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement.
What problem does this paper attempt to address?
The problem this paper attempts to address is the difficulty in integrating different large language models (LLMs) due to their vocabulary differences. Specifically, existing ensemble methods typically can only select or merge fully generated outputs, without the ability to dynamically correct and enhance outputs during the generation process, which limits the effectiveness of integration. To solve this problem, the authors propose a new method—Ensemble via Vocabulary Alignment (EVA). EVA can bridge the vocabulary gap between different LLMs, enabling fine-grained integration at each generation step.
### Main Contributions
1. **Fine-Grained Integration**: A new LLM integration method is proposed, which enables fine-grained integration at each generation step, thereby unleashing the complementary potential of different LLMs.
2. **Filtering Strategy**: An effective filtering strategy is designed to exclude models that generate unfaithful tokens, preventing poorly performing models from misleading the overall judgment.
3. **Experimental Proof**: Experimental results show that this method significantly improves overall performance in various natural language processing tasks, outperforming individual LLMs and previous ensemble methods based on fully generated outputs.
### Method Overview
1. **Cross-Model Vocabulary Alignment**:
- **Vocabulary Projection**: Using overlapping tokens as supervision labels, the word embeddings of different models are mapped to a common vector space.
- **Noise Reduction**: Noise is reduced through three steps, retaining relevant and concise alignment information.
2. **LLMs Integration**:
- **Output Distribution Alignment**: Using the established vocabulary relationships, the output distribution of non-pivot models is aligned to the pivot model's space.
- **Filtering Strategy**: Ensures the generated tokens are consistent, excluding models that generate unfaithful tokens.
### Experimental Setup
- **Datasets**: Includes tasks such as machine translation, data-to-text generation, commonsense reasoning, and arithmetic reasoning.
- **Candidate LLMs**: Seven open-source chat LLMs of approximately 7B size were selected.
- **Baseline Methods**: Compared with existing selection-based and fusion-based methods.
### Experimental Results
- **NLG Tasks**: In machine translation and data-to-text generation tasks, EVA significantly outperforms individual LLMs and previous ensemble methods.
- **Reasoning Tasks**: In commonsense reasoning and arithmetic reasoning tasks, EVA also performs excellently, especially in the GSM8K task, improving by 10.61% compared to the best single model.
### Analysis
- **Impact of Model Filtering Intensity**: Different tasks have varying sensitivity to model filtering intensity. For example, arithmetic reasoning tasks are very sensitive to filtering intensity, while other tasks are less so.
- **Impact of the Number of Integrated Models**: As the number of integrated models increases, EVA's performance continues to improve, indicating that different models have unique knowledge that can further enhance performance through integration.
### Related Work
- **Selection-Based Ensemble**: Selects the best output from multiple outputs but is limited by the quality of the candidate models' outputs.
- **Fusion-Based Ensemble**: Bypasses the limitations of existing complete outputs, usually generating better outputs.
Through these methods and analyses, EVA successfully addresses the integration challenges caused by vocabulary differences between different LLMs, achieving fine-grained integration and significant performance improvements in various natural language processing tasks.