PinLID: a dataset for Pinglish language identiftcation based on code-mixing sentence on unstructured resources
Arash Ghafouri,Hasan Naderi,Mahdi Firouzmandi
DOI: https://doi.org/10.1007/s10579-024-09783-3
2024-12-09
Language Resources and Evaluation
Abstract:Language identification is a major task in natural language processing. It serves as an initial and effective stage in critical tasks such as information extraction, sentiment analysis, and question answering. Most research on language identification has focused on monolingual contexts, performing poorly with texts containing code-mixing. Identifying the language in social media texts, such as those on Twitter, poses challenges due to high levels of code-mixing. Consequently, creating an accurate language identification tool for code-mixed texts is essential for intelligent systems that rely on natural language processing, such as advanced search engines and question-answering systems. Recently, significant research has been conducted in non-Persian languages in this field. However, no substantial efforts have been made to recognize languages in code-mixed Persian texts. In this paper, we introduce a dataset called PinLID, collected from tweets with Persian-English code-mixing, labeled at both the sentence and token levels using a supervised learning approach to language identification. We evaluated the dataset using various machine learning classification algorithms, including the classical SVM method, the multilingual BERT language model, XLM-RoBERTa, ParsBERT, AriaBERT, and PersianLLaMA: Persian Large Language Model. The testing yielded results as high as 99.59% F1 score at both the sentence and token levels in the test data.
computer science, interdisciplinary applications