Abstract:In-context learning (ICL) facilitates large language models (LLMs) exhibiting spectacular emergent capabilities in various scenarios. Unfortunately, introducing demonstrations easily makes the prompt length explode, bringing a significant burden to hardware. In addition, random demonstrations usually achieve limited improvements in ICL, necessitating demonstration selection among accessible candidates. Previous studies introduce extra modules to perform demonstration compression or selection independently. In this paper, we propose an ICL framework UniICL, which Unifies demonstration selection and compression, and final response generation via a single frozen LLM. Specifically, UniICL first projects actual demonstrations and inference text inputs into short virtual tokens, respectively. Then, virtual tokens are applied to select suitable demonstrations by measuring semantic similarity within latent space among candidate demonstrations and inference input. Finally, inference text inputs together with selected virtual demonstrations are fed into the same frozen LLM for response generation. Notably, UniICL is a parameter-efficient framework that only contains 17M trainable parameters originating from the projection layer. We conduct experiments and analysis over in- and out-domain datasets of both generative and understanding tasks, encompassing ICL scenarios with plentiful and limited demonstration candidates. Results show that UniICL effectively unifies $12 \times$ compression, demonstration selection, and response generation, efficiently scaling up the baseline from 4-shot to 64-shot ICL in IMDb with 24 GB CUDA allocation
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to address the issues of excessive input length caused by introducing examples in In-Context Learning (ICL) and the limited improvements typically achieved by randomly selecting examples. Specifically:
1. **Input Length Explosion**: In ICL, introducing demonstrations significantly increases input length, imposing a heavy burden on hardware and reducing inference throughput.
2. **Quality of Example Selection**: Randomly selected examples usually only bring limited performance improvements, necessitating effective selection from available candidate examples.
To tackle these issues, existing research often introduces additional modules for independent example compression or selection. However, these methods increase memory overhead, and independent compressors or rankers need to be loaded alongside the target large language model (LLM).
### Solution
This paper proposes a new ICL framework—UniICL, which unifies example selection, compression, and final response generation through a single frozen LLM. The specific contributions are as follows:
1. **Unified Framework**: UniICL is the first to propose an ICL framework that unifies example compression, selection, and generation through a single frozen LLM.
2. **Memory-Friendly**: UniICL is a parameter-efficient framework, containing only 17M trainable parameters, enabling large-scale ICL on consumer-grade GPUs.
3. **Demonstration Bank Configuration**: UniICL proposes configuring a Demonstration Bank (DB) to avoid redundant compression of the same examples, improving ICL efficiency.
### Method Overview
1. **Example Compression**: UniICL leverages the semantic understanding capabilities of the target LLM to independently compress different examples into compressed features, then uses a learnable projection layer to convert these features into compressed virtual tokens acceptable by the LLM.
2. **Example Selection**: The compressed virtual tokens are used not only to replace the original examples to reduce input length but also to select potential examples. Finally, the current query and the selected compressed virtual tokens are input into the same frozen LLM to generate responses.
3. **Response Generation**: UniICL generates responses through the frozen LLM, combining compressed virtual tokens and actual inference input for autoregressive generation.
### Experimental Results
Experimental results show that UniICL effectively unifies 12x compression, example selection, and response generation, expanding the baseline from 4-shot to 64-shot ICL under 24GB CUDA allocation. Additionally, UniICL performs excellently in multiple benchmarks, including language acceptability, semantic classification, text summarization, and paragraph reordering tasks.
### Conclusion
The UniICL framework proposed in this paper unifies example selection, compression, and generation through a single frozen LLM, effectively addressing the issues of input length explosion and example selection quality in ICL. Experimental results validate the effectiveness and efficiency of this framework.