Abstract:Real-world applications of large language models (LLMs) in computational social science (CSS) tasks primarily depend on the effectiveness of instruction tuning (IT) or in-context learning (ICL). While IT has shown highly effective at fine-tuning LLMs for various tasks, ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates. In this paper, we evaluate the classification performance of LLMs using IT versus ICL in few-shot CSS tasks. The experimental results indicate that ICL consistently outperforms IT in most CSS tasks. Additionally, we investigate the relationship between the increasing number of training samples and LLM performance. Our findings show that simply increasing the number of samples without considering their quality does not consistently enhance the performance of LLMs with either ICL or IT and can sometimes even result in a performance decline. Finally, we compare three prompting strategies, demonstrating that ICL is more effective than zero-shot and Chain-of-Thought (CoT). Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings and emphasizes the importance of optimizing sample quality and prompting strategies to improve LLM classification performance. The code will be made available.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the difference in classification performance between instruction tuning (IT) and in - context learning (ICL) when large - scale language models (LLMs) are used in few - shot computational social science (CSS) tasks. Specifically, the researchers focus on the following aspects: 1. **Performance Difference**: Explore how the performance of LLMs through ICL and IT differs in few - shot CSS tasks. 2. **Impact of Sample Quantity**: Analyze how different numbers of training samples affect the performance of LLMs under ICL and IT. 3. **Effect of Prompt Strategies**: Compare how different prompt strategies (zero - shot, ICL, and chain - of - thought (CoT)) affect the performance of LLMs in CSS tasks. ### Research Background Computational social science (CSS) is a dynamic research field that involves detailed language analysis and in - depth semantic understanding. Traditional zero - shot prompt methods may perform poorly when handling CSS tasks, and may even be inferior to fully - tuned small task - specific models (such as BERT). Therefore, the researchers hope to find a more effective adaptation method through the comparison of ICL and IT. ### Main Findings 1. **ICL is Superior to IT**: In the few - shot setting, the performance of LLMs through ICL is generally better than that through IT. For example, in the 1 - shot setting, the average accuracy of ICL is 3.3% higher than that of IT. 2. **Impact of Sample Quantity**: Simply increasing the number of samples does not always improve the performance of LLMs, and sometimes may even lead to a performance decline. This indicates that the quality of samples is more important than the quantity. 3. **Effect of Prompt Strategies**: The ICL prompt strategy performs the best among the three strategies, followed by CoT, and the zero - shot strategy performs the worst. ### Experimental Setup The researchers selected five publicly available datasets covering a wide range of CSS topics and used six open - source large - scale language models for the experiment. Each model was evaluated in 1 - shot, 8 - shot, 16 - shot, and 32 - shot settings. ### Conclusions 1. **Advantages of ICL**: ICL shows stronger adaptability when handling complex CSS tasks and can quickly adapt to tasks by using pre - trained knowledge. 2. **Importance of Sample Quality**: In the few - shot setting, the quality of samples is more important than the quantity. 3. **Selection of Prompt Strategies**: The ICL prompt strategy is the most effective, and overly complex prompt strategies (such as CoT) may negatively affect performance. ### Limitations 1. **Resource Constraints**: Due to computational resource limitations, larger n - shot settings were not explored. 2. **Range of Model Parameters**: The experiment mainly focused on models with the number of parameters between 7B and 9B. 3. **Quality of CoT Generation**: The CoT descriptions mainly rely on automatic generation by GPT - 4 and may have the problem of inconsistent quality. Through these studies, the paper provides important references and guidance for the future application of LLMs in CSS tasks.

Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science

Exploring the Relationship between In-Context Learning and Instruction Tuning

Investigating the Learning Behaviour of In-Context Learning: A Comparison with Supervised Learning

Knowledgeable In-Context Tuning: Exploring and Exploiting Factual Knowledge for In-Context Learning

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning

Task-Level Thinking Steps Help Large Language Models for Challenging Classification Task

Revisiting In-Context Learning with Long Context Language Models

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Can Large Language Models Transform Computational Social Science?

Many-Shot In-Context Learning

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

Why Larger Language Models Do In-context Learning Differently?

Manipulating the Label Space for In-Context Classification

Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet?

Instruction Tuning for Large Language Models: A Survey

In-Context Learning Demonstration Selection via Influence Analysis

CSS-LM: A Contrastive Framework for Semi-Supervised Fine-Tuning of Pre-Trained Language Models

Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

Large Language Models Know What Makes Exemplary Contexts