Abstract:Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.
What problem does this paper attempt to address?
The paper mainly addresses the following issues:
1. **Building a Multilingual, Multi-Topic Indian Social Media Dataset**: The authors constructed a large-scale dataset named MMT, which collected approximately 1.7 million tweets from Twitter, covering 13 coarse-grained and 63 fine-grained topics involving various subjects of Indian society. These topics include the environment, food, economy and retail, natural disasters, arts and literature, sports, politics, research and development & technology, wildlife and vegetation, manufacturing, movies & OTT, news media, and education.
2. **Language Identification Challenge**: The study found that existing language identification tools perform poorly when dealing with the linguistic diversity in the MMT dataset, especially on multilingual and code-mixed texts. About 11.45% of the tweets were labeled as code-mixed, and Twitter's language identification system often mislabels these mixed-language tweets.
3. **Topic Modeling Evaluation**: The paper discusses the performance of traditional topic modeling tools (such as LDA) in a multilingual environment and introduces a Zero-Shot Cross-Lingual Contextual Topic Model (ZeroShotTM) to overcome the limitations of bag-of-words models, which ignore syntax and word order and only consider word frequency. Experimental results show that ZeroShotTM outperforms LDA in topic modeling tasks, especially on multilingual datasets.
4. **Performance of Multilingual Identification Tools**: The paper also evaluates the performance of various multilingual identification systems (such as Polyglot, FastText, Langdetect, and CLD3) on the MMT dataset, finding that they perform well on English data but are less effective on non-English or multilingual data.
5. **Future Work**: The paper notes that there are challenges in building a robust multilingual system that performs well across all languages due to the overrepresentation of English in the dataset and the underrepresentation of certain languages. To address this issue, data augmentation techniques such as paraphrasing and oversampling, as well as transfer learning methods, can be employed.
In summary, the paper delves into the processing and analysis of multilingual and multi-topic social media data by constructing the MMT dataset, highlighting the limitations of existing tools and techniques in handling such complex data, and proposing some directions for improvement.