Abstract:Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. However, existing work focuses on training an extra reward model or fusion model to select or combine all candidate answers, posing a great challenge to the generalization on unseen data distributions. Besides, prior methods use textual responses as communication media, ignoring the valuable information in the internal representations. In this work, we propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step. Unfortunately, the vocabulary discrepancy between heterogeneous LLMs directly makes averaging the distributions unfeasible due to the token misalignment. To address this challenge, DeePEn maps the probability distribution of each model from its own probability space to a universal relative space based on the relative representation theory, and performs aggregation. Next, we devise a search-based inverse transformation to transform the aggregated result back to the probability space of one of the ensembling LLMs (main model), in order to determine the next token. We conduct extensive experiments on ensembles of different number of LLMs, ensembles of LLMs with different architectures, and ensembles between the LLM and the specialist model. Experimental results show that (i) DeePEn achieves consistent improvements across six benchmarks covering subject examination, reasoning, and knowledge, (ii) a well-performing specialist model can benefit from a less effective LLM through distribution fusion, and (iii) DeePEn has complementary strengths with other ensemble methods such as voting.

Distilling Knowledge from an Ensemble of Models for Punctuation Prediction.

Incorporating External POS Tagger for Punctuation Restoration

Self-Attention Based Model For Punctuation Prediction Using Word And Speech Embeddings

Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural Network

Focal Loss for Punctuation Prediction.

Transfer knowledge for punctuation prediction via adversarial training

Multimodal Punctuation Prediction with Contextual Dropout

A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

Adversarial Transfer Learning for Punctuation Restoration

PILE: Pairwise Iterative Logits Ensemble for Multi-Teacher Labeled Distillation

A Context-Aware Feature Fusion Framework for Punctuation Restoration

Ensemble Method to Joint Inference for Knowledge Extraction

FF2: A Feature Fusion Two-Stream Framework for Punctuation Restoration

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

Distilling the Knowledge in a Neural Network

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

Unified and Effective Ensemble Knowledge Distillation

Predicting Punctuation in Ancient Chinese Texts: A Multi-Layered LSTM and Attention-Based Approach

GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation

Ensemble Knowledge Distillation of Self-Supervised Speech Models