Abstract:Motivated by the seemingly high accuracy levels of machine learning models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 Evaluation Campaign. The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, e.g. the top model for Moldavian versus Romanian dialect identification obtained a macro F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, e.g. when we shorten the text samples to single sentences or when we use tweets at inference time. A secondary goal of our work is to propose an improved ML model using ensemble learning. Our experiments show that ML models can accurately identify the dialects, even at the sentence level and across different domains (news articles versus tweets). We also analyze the most discriminative features of the best performing models, providing some explanations behind the decisions taken by these models. Interestingly, we learn new dialectal patterns previously unknown to us or to our human annotators. Furthermore, we conduct experiments showing that the machine learning performance on the MRC shared task can be improved through an ensemble based on stacking.

Literary and Colloquial Tamil Dialect Identification

Literary and Colloquial Dialect Identification for Tamil using Acoustic Features

A Feature Engineering Approach for Literary and Colloquial Tamil Speech Classification using 1D-CNN

Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages

A Deep Learning Approach for Similar Languages, Varieties and Dialects

Convolutional neural network based language identification system: A spectrogram based approach

IruMozhi: Automatically classifying diglossia in Tamil

Exploiting Spectral Augmentation for Code-Switched Spoken Language Identification

Evaluating Dialect Robustness of Language Models via Conversation Understanding

Towards spoken dialect identification of Irish

Learning to Recognize Dialect Features

Dynamic Multi-scale Convolution for Dialect Identification

Dialect Identification Using Spectral and Prosodic Features on Single and Ensemble Classifiers

Two-stage Pipeline for Multilingual Dialect Detection

Automatic Language Identification Using Support Vector Machines and Phonetic N-gram

The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification

Low-resource speech recognition and dialect identification of Irish in a multi-task framework

Dialect Identification in Telugu Language Speech Utterance Using Modified Features with Deep Neural Network

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

Deep Learning Speech Synthesis Model for Word/Character-Level Recognition in the Tamil Language

A novel nearest interest point classifier for offline Tamil handwritten character recognition