Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

Aiqi Jiang,Arkaitz Zubiaga
2024-01-17
Abstract:The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to systematically review datasets, transfer methods and challenges in the field of cross - language offensive language detection. Specifically, the paper focuses on the following aspects: 1. **Dataset analysis**: - **Multilingual datasets**: Analyzed the multilingual datasets used in 67 related papers for cross - language offensive language detection, including characteristics such as the sources, languages, scales, and label types of the datasets. - **Dataset distribution**: Explored the distribution of datasets in terms of different topics (such as offensiveness, hate speech), data sources (such as Twitter, Facebook), languages and language families. 2. **Cross - language resources**: - **Multilingual dictionaries**: Discussed the application of multilingual dictionaries (such as HurtLex) in cross - language tasks. These dictionaries provide direct translations or equivalent words between different languages. - **Parallel corpora**: Introduced the role of parallel corpora (sentence - aligned). These corpora contain sentence pairs in two or more languages and help bridge the language gap. 3. **Cross - language transfer learning (CLTL) techniques**: - **Transfer strategies**: Summarized three main CLTL transfer methods: instance transfer, feature transfer and parameter transfer, and described in detail the specific implementation and application scenarios of each method. - **Model adaptation**: Discussed how to fine - tune the model on the target language to improve its performance in the target domain. 4. **Current challenges and future research directions**: - **Data scarcity**: Emphasized the lack of annotated data in low - resource languages and how to alleviate this problem through CLTL techniques. - **Linguistic differences**: Explored the linguistic differences between different languages, which pose challenges to the generalization ability of cross - language models. - **Ethical and legal issues**: Discussed the ethical and legal issues that need to be considered when developing and deploying cross - language offensive language detection systems. ### Summary By systematically reviewing existing literature, this paper comprehensively analyzes the current situation, challenges and future research directions in the field of cross - language offensive language detection. It pays special attention to the application of multilingual datasets, cross - language resources and CLTL techniques, aiming to provide a comprehensive reference framework for researchers and promote the further development of this field.