Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Meysam Alizadeh,Maël Kubli,Zeynab Samei,Shirin Dehghani,Mohammadmasiha Zahedivafa,Juan Diego Bermeo,Maria Korobeynikova,Fabrizio Gilardi

2024-05-29

Abstract:This paper studies the performance of open-source Large Language Models (LLMs) in text classification tasks typical for political science research. By examining tasks like stance, topic, and relevance classification, we aim to guide scholars in making informed decisions about their use of LLMs for text analysis. Specifically, we conduct an assessment of both zero-shot and fine-tuned LLMs across a range of text annotation tasks using news articles and tweets datasets. Our analysis shows that fine-tuning improves the performance of open-source LLMs, allowing them to match or even surpass zero-shot GPT-3.5 and GPT-4, though still lagging behind fine-tuned GPT-3.5. We further establish that fine-tuning is preferable to few-shot training with a relatively modest quantity of annotated text. Our findings show that fine-tuned open-source LLMs can be effectively deployed in a broad spectrum of text annotation applications. We provide a Python notebook facilitating the application of LLMs in text annotation for other researchers.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of how to choose an appropriate training method for text annotation tasks, particularly focusing on the performance of large language models (LLMs) in political science research. Specifically, the authors compare the effectiveness of three methods: Zero-Shot, Few-Shot, and Fine-Tuning, and explore the performance differences between open-source large language models (such as LLaMA and FLAN) and proprietary large language models (such as GPT-3.5 and GPT-4). The main goal of the paper is to guide researchers on how to make informed choices when conducting text classification tasks, including whether manual data annotation is necessary, which type of model to choose, and the specific methods for fine-tuning. Through empirical analysis, the authors find that fine-tuning can significantly enhance the performance of open-source LLMs, even surpassing the performance of GPT-3.5 and GPT-4 on certain tasks. Additionally, the paper highlights the advantages of open-source LLMs in terms of cost-effectiveness, transparency, and data protection, and provides Python notebooks to assist other researchers in applying these models.

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Large Language Models for Data Annotation: A Survey

Prompting and Fine-Tuning Open-Sourced Large Language Models for Stance Classification

Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing

The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation

Lawma: The Power of Specialization for Legal Tasks

Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

Assessing Open-Source Large Language Models on Argumentation Mining Subtasks

Best Practices for Text Annotation with Large Language Models

Evaluation is all you need. Prompting Generative Large Language Models for Annotation Tasks in the Social Sciences. A Primer using Open Models

Prompt Refinement or Fine-tuning? Best Practices for using LLMs in Computational Social Science Tasks

Open, Closed, or Small Language Models for Text Classification?

Large Language Models for Data Annotation and Synthesis: A Survey

Positioning Political Texts with Large Language Models by Asking and Averaging

How to use LLMs for Text Analysis

Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance