Abstract:Despite the impressive performance of large language models (LLMs), they often lag behind specialized models in various tasks. LLMs only use a fraction of the existing training data for in-context learning, while task-specific models harness the full dataset for fine-tuning. In this work, we tackle the problem of leveraging training data to improve the performance of LLMs without fine-tuning. Our approach directly targets LLM predictions without requiring access to their weights. We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output. Our experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning. Furthermore, we illustrate the robustness of LMCor against different prompts, thereby minimizing the need for extensive prompt engineering. Finally, we show that LMCor can be seamlessly integrated with different LLMs at inference, serving as a plug-and-play module to improve their performance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily aims to address the performance improvement of large language models (LLMs) across different tasks, specifically: 1. **Enhancing performance using training data**: Although large language models perform well on various tasks, they often do not perform as well on specific tasks compared to smaller models fine-tuned for those tasks. This paper proposes a method to improve the performance of LLMs by leveraging existing training data without the need for fine-tuning. 2. **Reducing the need for prompt engineering**: Traditional few-shot learning methods often require extensive prompt engineering, which is time-consuming and does not necessarily guarantee performance improvement. The method proposed in this paper aims to reduce the reliance on complex prompt design by optimizing and merging candidate answers. 3. **No need to access model weights**: Unlike traditional fine-tuning methods, the proposed method directly operates on the outputs generated by LLMs without needing to access the model weights. This makes it suitable for commercial models that can only be accessed through restricted inference APIs. ### Main Contributions - Introduced LM-Corrector (LMC OR), a small model that can enhance the performance of LLMs by merging and correcting multiple candidate answers generated by LLMs without accessing their weights. - Conducted experiments on four natural language generation tasks, demonstrating that even a relatively small LMC OR model (250 million parameters) can significantly improve the performance of LLMs with 62 billion parameters, and in some cases, even surpass specialized fine-tuned models. - Showcased the robustness of LMC OR to different prompts, reducing the need for precise prompt design. - Demonstrated that LMC OR can be seamlessly integrated as a plug-and-play module into different LLMs, enhancing their generality and flexibility.

Small Language Models Improve Giants by Rewriting Their Outputs

Large Language Models are Contrastive Reasoners

Small Language Model Can Self-correct

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Smaller Large Language Models Can Do Moral Self-Correction

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?

Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought

Harnessing Large Language Models as Post-hoc Correctors

Large language model programs

Supervised Knowledge Makes Large Language Models Better In-context Learners

Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Learning to Reduce: Optimal Representations of Structured Data in Prompting Large Language Models

Large Language Models are Zero-Shot Reasoners

Large Language Models aren't all that you need

Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation

Large Language Models Can Self-Improve in Long-context Reasoning

Small Language Models Fine-tuned to Coordinate Larger Language Models improve Complex Reasoning

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts