Abstract:Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

On Importance of Code-Mixed Embeddings for Hate Speech Identification

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

Leveraging Language Identification to Enhance Code-Mixed Text Classification

SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis

OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Language Identification of Hindi-English tweets using code-mixed BERT

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

BharatBhasaNet-A Unified Framework to Identify Indian Code Mix Languages

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario

Language Modeling for Code-Switched Data: Challenges and Approaches

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing