Abstract:Fusing knowledge from multiple Large Language Models (LLMs) can combine their diverse strengths to achieve improved performance on a given task. However, current fusion approaches either rely on learning-based fusers that do not generalize to new LLMs, or do not take into account how well each LLM understands the input. In this work, we study LLM fusion at test-time, which enables leveraging knowledge from arbitrary user-specified LLMs during inference. We introduce Pack of LLMs (PackLLM), an effective method for test-time fusion that leverages each LLM's expertise, given an input prompt. PackLLM performs model fusion by solving an optimization problem for determining each LLM's importance, so that perplexity over the input prompt is minimized. First, our simple PackLLM-sim variant validates that perplexity is a good indicator for measuring each LLM's expertise. Second, our PackLLM-opt variant approximately solves the perplexity minimization problem via a greedy algorithm. The derived importance weights are used to combine the LLMs during inference. We conduct experiments with over 100 total LLMs on a diverse set of tasks. Experimental results show that (i) perplexity is a reliable measure for LLM fusion, (ii) PackLLM outperforms test-time fusion baselines by 1.89% accuracy points, and (iii) PackLLM can leverage new LLMs to improve performance over learning-based fusion approaches by 3.92-11.94% accuracy points.

What problem does this paper attempt to address?

The problem this paper attempts to address is how to effectively combine the knowledge of multiple large language models (LLMs) to improve performance on specific tasks. Specifically, existing fusion methods either rely on learning-based fusion modules that cannot generalize to new LLMs or do not consider each LLM's understanding of the input. Therefore, this paper proposes a test-time fusion method called Pack of LLMs (PackLLM), which can utilize any user-specified LLMs during inference and determine the importance weight of each LLM by minimizing perplexity. ### Main Contributions: 1. **Problem Definition**: The study defines the problem of test-time LLM fusion as a weighted ensemble problem and proposes an optimization problem to minimize perplexity to determine the importance weights of LLMs. 2. **Algorithm**: Introduces PackLLM opt, a method that approximates the perplexity minimization problem using a greedy algorithm. Also introduces a simple perplexity-based ensemble method, PackLLM sim. 3. **Effectiveness**: Experiments validate that perplexity is a reliable indicator of model importance. PackLLM outperforms existing test-time fusion baselines on various tasks and can significantly improve performance by leveraging newly released LLMs. ### Experimental Results: - Experiments conducted on over 100 LLMs show that perplexity is a reliable fusion indicator. - PackLLM achieves an average accuracy improvement of 1.72–1.89% over existing test-time fusion methods across 25 tasks. - By utilizing newly released LLMs, PackLLM outperforms competitive learning-based fusion methods by 3.92–11.94% in accuracy. ### Method Overview: - **Perplexity Minimization**: PackLLM determines the importance weight of each LLM by minimizing the perplexity of the input prompt. - **Simple Method PackLLM sim**: Directly uses perplexity scores to calculate weights. - **Optimized Method PackLLM opt**: Approximates the perplexity minimization problem using a greedy algorithm to more effectively combine the knowledge of different LLMs. ### Advantages: - Does not require any training or labeled data, allowing for quick adaptation to newly released LLMs. - Performs well when handling LLMs of different scales and specializations. ### Conclusion: PackLLM is an effective test-time fusion method that determines the importance weight of each LLM by minimizing perplexity, thereby significantly improving performance across various tasks.

Pack of LLMs: Model Fusion at Test-Time via Perplexity Optimization