Abstract:With the growing popularity of code-mixed data, there is an increasing need for better handling of this type of data, which poses a number of challenges, such as dealing with spelling variations, multiple languages, different scripts, and a lack of resources. Current language models face difficulty in effectively handling code-mixed data as they primarily focus on the semantic representation of words and ignore the auditory phonetic features. This leads to difficulties in handling spelling variations in code-mixed text. In this paper, we propose an effective approach for creating language models for handling code-mixed textual data using auditory information of words from SOUNDEX. Our approach includes a pre-training step based on masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a new method of providing input data to the pre-trained model. Through experimentation on various code-mixed datasets (of different languages) for sentiment, offensive and aggression classification tasks, we establish that our novel language modeling approach (SAMLM) results in improved robustness towards adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM based approach also results in better classification results over the popular baselines for code-mixed tasks. We use the explainability technique, SHAP (SHapley Additive exPlanations) to explain how the auditory features incorporated through SAMLM assist the model to handle the code-mixed text effectively and increase robustness against adversarial attacks \footnote{Source code has been made available on \url{<a class="link-external link-https" href="https://github.com/20118/DefenseWithPhonetics" rel="external noopener nofollow">this https URL</a>}, \url{<a class="link-external link-https" href="https://www.iitp.ac.in/~ai-nlp-ml/resources.html" rel="external noopener nofollow">this https URL</a>\#Phonetics}}.

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Normalizing Text using Language Modelling based on Phonetics and String Similarity

Adapting Sequence to Sequence models for Text Normalization in Social Media

A Simple and Efficient Probabilistic Language model for Code-Mixed Text

Language Modeling for Code-Switched Data: Challenges and Approaches

Transformer-based Models of Text Normalization for Speech Applications

Elevating Code-mixed Text Handling through Auditory Information of Words

Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Machine Normalization

Study of Encoder-Decoder Architectures for Code-Mix Search Query Translation

Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition

A two-stage transliteration approach to improve performance of a multilingual ASR

Consensus-Based Machine Translation for Code-Mixed Texts

Language-agnostic Multilingual Modeling

Leveraging Language Identification to Enhance Code-Mixed Text Classification

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

Transforming Sequence Tagging Into A Seq2Seq Task

Automatic Textual Normalization for Hate Speech Detection

Chinese-English Mixed Text Normalization