Revisiting Demonstration Selection Strategies in In-Context Learning

Keqin Peng,Liang Ding,Yancheng Yuan,Xuebo Liu,Min Zhang,Yuanxin Ouyang,Dacheng Tao
2024-06-23
Abstract:Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL), where a few examples are used to describe a task to the model. However, the performance of ICL varies significantly with the choice of demonstrations, and it is still unclear why this happens or what factors will influence its choice. In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent. We further proposed a data- and model-dependent demonstration selection method, \textbf{TopK + ConE}, based on the assumption that \textit{the performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples}, resulting in a simple and effective recipe for ICL. Empirically, our method yields consistent improvements in both language understanding and generation tasks with different model scales. Further analyses confirm that, besides the generality and stability under different circumstances, our method provides a unified explanation for the effectiveness of previous methods. Code will be released.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the issue of how to select effective example problems in In-Context Learning (ICL). Specifically, the research finds that the effectiveness of ICL is highly sensitive to the chosen examples, and this sensitivity is not only related to the test data but also closely linked to the retrieval module and the inference model. The paper proposes a new hypothesis that effective examples can enhance the model's understanding of the test input. Based on this hypothesis, a method named TopK+ConE is proposed. This method selects examples by minimizing the conditional entropy of the test input under the inference model, thereby improving the model's understanding of the test samples. Experimental results show that this method consistently improves performance across models of different scales and demonstrates superior effectiveness in various natural language understanding and generation tasks. Additionally, the paper validates the method's effectiveness in multi-domain example pools and dialogue models, and analyzes the impact of hyperparameters through ablation experiments.