Abstract:The advent of Large Language Models (LLMs) has advanced the benchmark in various Natural Language Processing (NLP) tasks. However, large amounts of labelled training data are required to train LLMs. Furthermore, data annotation and training are computationally expensive and time-consuming. Zero and few-shot learning have recently emerged as viable options for labelling data using large pre-trained models. Hate speech detection in mix-code low-resource languages is an active problem area where the use of LLMs has proven beneficial. In this study, we have compiled a dataset of 100 YouTube comments, and weakly labelled them for coarse and fine-grained misogyny classification in mix-code Hinglish. Weak annotation was applied due to the labor-intensive annotation process. Zero-shot learning, one-shot learning, and few-shot learning and prompting approaches have then been applied to assign labels to the comments and compare them to human-assigned labels. Out of all the approaches, zero-shot classification using the Bidirectional Auto-Regressive Transformers (BART) large model and few-shot prompting using Generative Pre-trained Transformer- 3 (ChatGPT-3) achieve the best results

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to conduct hate - speech detection using weakly - labeled data in the mixed - code Hindi - English (Hinglish). Specifically, the researchers compiled a dataset containing 100 YouTube comments and weakly labeled these comments for coarse - grained and fine - grained misogyny classification. Since the data - labeling process is labor - intensive and time - consuming, zero - shot learning, one - shot learning, and few - shot learning methods were adopted to assign labels, and these labels were compared with the manually - assigned labels. The main purpose of the study was to explore the feasibility and effectiveness of using large language models (LLMs) for hate - speech detection in low - resource languages. The key points of the paper are as follows: 1. **Research Background**: - The popularity of social media has led to an increase in online hate speech, which has a negative impact on the mental health of target individuals. - Traditional machine - learning and deep - learning models have shown potential in handling hate - speech detection in low - resource and code - mixed languages, but they require a large amount of labeled data, which is both expensive and time - consuming. 2. **Research Method**: - The dataset consists of 100 YouTube comments, which are weakly labeled as misogyny (MGY) or non - misogyny (NOT). - For the comments labeled as misogyny, they are further subdivided into 9 categories, such as "misleading", "sexual harassment and violence threats", "stereotype", etc. - Use zero - shot learning, one - shot learning, and few - shot learning methods to classify the comments and compare the results with those of manual labeling. 3. **Research Hypothesis**: - Zero - shot learning, one - shot learning, and few - shot learning can be reliably used for coarse - grained and fine - grained misogyny classification of mixed - code Hinglish YouTube comments. 4. **Research Results**: - Zero - shot learning performs best in the binary - classification task, with an accuracy rate of 54%, but performs poorly in the multi - label classification task. - One - shot learning and few - shot learning perform worse than zero - shot learning in the binary - classification task, but show some potential in the multi - label classification task. - One - shot prompting using ChatGPT - 3 performs well in the fine - grained classification task and can correctly identify most of the labels. 5. **Discussion and Conclusion**: - Although zero - shot learning performs best in the binary - classification task, its overall performance is still limited, especially in the multi - label classification task. - The study shows that large language models have certain application potential in hate - speech detection in low - resource languages, especially through few - shot learning and one - shot prompting methods. - Future work should focus on verifying the effectiveness of these methods on larger datasets and exploring the application of multilingual large language models in misogyny classification. In conclusion, this paper verifies the feasibility of using large language models for hate - speech detection in low - resource languages through experiments, providing valuable references for future related research.

Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models

Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Model-Agnostic Meta-Learning for Multilingual Hate Speech Detection

A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

HateTinyLLM : Hate Speech Detection Using Tiny Large Language Models

Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation

Deep Learning Models for Multilingual Hate Speech Detection

A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs

Transfer Learning for Hate Speech Detection in Social Media

A Target-Aware Analysis of Data Augmentation for Hate Speech Detection

On Importance of Code-Mixed Embeddings for Hate Speech Identification

Hate Speech Detection in Low-Resource Bodo and Assamese Texts with ML-DL and BERT Models

Highly Generalizable Models for Multilingual Hate Speech Detection

LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection

Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

Leveraging Language Identification to Enhance Code-Mixed Text Classification

YouTube Comments Decoded: Leveraging LLMs for Low Resource Language Classification

Probing LLMs for hate speech detection: strengths and vulnerabilities