Abstract:Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.

What problem does this paper attempt to address?

The paper mainly addresses the following issues: 1. **Building a Multilingual, Multi-Topic Indian Social Media Dataset**: The authors constructed a large-scale dataset named MMT, which collected approximately 1.7 million tweets from Twitter, covering 13 coarse-grained and 63 fine-grained topics involving various subjects of Indian society. These topics include the environment, food, economy and retail, natural disasters, arts and literature, sports, politics, research and development & technology, wildlife and vegetation, manufacturing, movies & OTT, news media, and education. 2. **Language Identification Challenge**: The study found that existing language identification tools perform poorly when dealing with the linguistic diversity in the MMT dataset, especially on multilingual and code-mixed texts. About 11.45% of the tweets were labeled as code-mixed, and Twitter's language identification system often mislabels these mixed-language tweets. 3. **Topic Modeling Evaluation**: The paper discusses the performance of traditional topic modeling tools (such as LDA) in a multilingual environment and introduces a Zero-Shot Cross-Lingual Contextual Topic Model (ZeroShotTM) to overcome the limitations of bag-of-words models, which ignore syntax and word order and only consider word frequency. Experimental results show that ZeroShotTM outperforms LDA in topic modeling tasks, especially on multilingual datasets. 4. **Performance of Multilingual Identification Tools**: The paper also evaluates the performance of various multilingual identification systems (such as Polyglot, FastText, Langdetect, and CLD3) on the MMT dataset, finding that they perform well on English data but are less effective on non-English or multilingual data. 5. **Future Work**: The paper notes that there are challenges in building a robust multilingual system that performs well across all languages due to the overrepresentation of English in the dataset and the underrepresentation of certain languages. To address this issue, data augmentation techniques such as paraphrasing and oversampling, as well as transfer learning methods, can be employed. In summary, the paper delves into the processing and analysis of multilingual and multi-topic social media data by constructing the MMT dataset, highlighting the limitations of existing tools and techniques in handling such complex data, and proposing some directions for improvement.

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms

SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text

Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models

Sentiment Analysis of Multilingual Tweets based on Natural Language Processing (NLP)

3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

Multilingual Topic Classification in X: Dataset and Analysis

CML-COVID: A Large-Scale COVID-19 Twitter Dataset with Latent Topics, Sentiment and Location Information

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Preparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages

Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

CTM -- A Model for Large-Scale Multi-View Tweet Topic Classification

MILU: A Multi-task Indic Language Understanding Benchmark

MIMIC: Misogyny Identification in Multimodal Internet Content in Hindi-English Code-Mixed Language

M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets

Large scale annotated dataset for code-mix abusive short noisy text

A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

M-MELD: A Multilingual Multi-Party Dataset for Emotion Recognition in Conversations