Large Language Models for Propaganda Span Annotation

Maram Hasanain,Fatema Ahmad,Firoj Alam
2024-10-06
Abstract:The use of propagandistic techniques in online content has increased in recent years aiming to manipulate online audiences. Fine-grained propaganda detection and extraction of textual spans where propaganda techniques are used, are essential for more informed content consumption. Automatic systems targeting the task over lower resourced languages are limited, usually obstructed by lack of large scale training datasets. Our study investigates whether Large Language Models (LLMs), such as GPT-4, can effectively extract propagandistic spans. We further study the potential of employing the model to collect more cost-effective annotations. Finally, we examine the effectiveness of labels provided by GPT-4 in training smaller language models for the task. The experiments are performed over a large-scale in-house manually annotated dataset. The results suggest that providing more annotation context to GPT-4 within prompts improves its performance compared to human annotators. Moreover, when serving as an expert annotator (consolidator), the model provides labels that have higher agreement with expert annotators, and lead to specialized models that achieve state-of-the-art over an unseen Arabic testing set. Finally, our work is the first to show the potential of utilizing LLMs to develop annotated datasets for propagandistic spans detection task prompting it with annotations from human annotators with limited expertise. All scripts and annotations will be shared with the community.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of automatic detection and extraction of propaganda techniques in online content. Specifically, the authors focus on fine-grained propaganda detection at the text fragment level and investigate the performance of large language models (such as GPT-4) in this task. The main issues addressed in the paper are as follows: 1. **Automatic Detection and Extraction of Propaganda Techniques**: - The use of propaganda techniques in online content is increasing, aiming to manipulate online audiences. Therefore, fine-grained detection and extraction of propaganda techniques from text fragments are crucial for more informed content consumption. 2. **Development of Automatic Systems for Low-Resource Languages**: - For low-resource languages, the development of automatic systems is often limited by the lack of large-scale training datasets. The paper explores whether large language models (such as GPT-4) can effectively extract propaganda fragments and be used to collect more cost-effective annotations. 3. **Using GPT-4 Generated Labels to Train Smaller Language Models**: - The paper further investigates the effectiveness of labels provided by GPT-4 in training smaller language models, particularly their performance on an Arabic test set. Through these issues, the authors aim to explore the potential of large language models in propaganda technique detection and annotation, and how these models can be leveraged to reduce the cost and effort of manual annotation.