Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science

Taihang Wang,Xiaoman Xu,Yimin Wang,Ye Jiang
2024-09-23
Abstract:Real-world applications of large language models (LLMs) in computational social science (CSS) tasks primarily depend on the effectiveness of instruction tuning (IT) or in-context learning (ICL). While IT has shown highly effective at fine-tuning LLMs for various tasks, ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates. In this paper, we evaluate the classification performance of LLMs using IT versus ICL in few-shot CSS tasks. The experimental results indicate that ICL consistently outperforms IT in most CSS tasks. Additionally, we investigate the relationship between the increasing number of training samples and LLM performance. Our findings show that simply increasing the number of samples without considering their quality does not consistently enhance the performance of LLMs with either ICL or IT and can sometimes even result in a performance decline. Finally, we compare three prompting strategies, demonstrating that ICL is more effective than zero-shot and Chain-of-Thought (CoT). Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings and emphasizes the importance of optimizing sample quality and prompting strategies to improve LLM classification performance. The code will be made available.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the difference in classification performance between instruction tuning (IT) and in - context learning (ICL) when large - scale language models (LLMs) are used in few - shot computational social science (CSS) tasks. Specifically, the researchers focus on the following aspects: 1. **Performance Difference**: Explore how the performance of LLMs through ICL and IT differs in few - shot CSS tasks. 2. **Impact of Sample Quantity**: Analyze how different numbers of training samples affect the performance of LLMs under ICL and IT. 3. **Effect of Prompt Strategies**: Compare how different prompt strategies (zero - shot, ICL, and chain - of - thought (CoT)) affect the performance of LLMs in CSS tasks. ### Research Background Computational social science (CSS) is a dynamic research field that involves detailed language analysis and in - depth semantic understanding. Traditional zero - shot prompt methods may perform poorly when handling CSS tasks, and may even be inferior to fully - tuned small task - specific models (such as BERT). Therefore, the researchers hope to find a more effective adaptation method through the comparison of ICL and IT. ### Main Findings 1. **ICL is Superior to IT**: In the few - shot setting, the performance of LLMs through ICL is generally better than that through IT. For example, in the 1 - shot setting, the average accuracy of ICL is 3.3% higher than that of IT. 2. **Impact of Sample Quantity**: Simply increasing the number of samples does not always improve the performance of LLMs, and sometimes may even lead to a performance decline. This indicates that the quality of samples is more important than the quantity. 3. **Effect of Prompt Strategies**: The ICL prompt strategy performs the best among the three strategies, followed by CoT, and the zero - shot strategy performs the worst. ### Experimental Setup The researchers selected five publicly available datasets covering a wide range of CSS topics and used six open - source large - scale language models for the experiment. Each model was evaluated in 1 - shot, 8 - shot, 16 - shot, and 32 - shot settings. ### Conclusions 1. **Advantages of ICL**: ICL shows stronger adaptability when handling complex CSS tasks and can quickly adapt to tasks by using pre - trained knowledge. 2. **Importance of Sample Quality**: In the few - shot setting, the quality of samples is more important than the quantity. 3. **Selection of Prompt Strategies**: The ICL prompt strategy is the most effective, and overly complex prompt strategies (such as CoT) may negatively affect performance. ### Limitations 1. **Resource Constraints**: Due to computational resource limitations, larger n - shot settings were not explored. 2. **Range of Model Parameters**: The experiment mainly focused on models with the number of parameters between 7B and 9B. 3. **Quality of CoT Generation**: The CoT descriptions mainly rely on automatic generation by GPT - 4 and may have the problem of inconsistent quality. Through these studies, the paper provides important references and guidance for the future application of LLMs in CSS tasks.