GRDD: A Dataset for Greek Dialectal NLP

Stergios Chatzikyriakidis,Chatrine Qwaider,Ilias Kolokousis,Christina Koula,Dimitris Papadakis,Efthymia Sakellariou

2023-11-25

Abstract:In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the problem of creating a dataset for the natural language processing (NLP) of modern Greek dialects and utilizing this dataset for dialect identification tasks. Specifically: 1. **Dataset Creation**: - The paper describes how text data for four modern Greek dialects (Cretan Greek, Pontic Greek, Northern Greek, and Cypriot Greek) and Standard Modern Greek (SMG) were collected from the internet. - The dataset is large but imbalanced, with the most data for Cypriot Greek and the least for Northern Greek. 2. **Dialect Identification Task**: - Dialect identification experiments were conducted using traditional machine learning algorithms (such as Ridge Regression Classifier, Naive Bayes, and Support Vector Machine) and simple deep learning architectures (such as BiLSTM). - The experimental results indicate that even simple machine learning models can achieve good performance in dialect identification tasks, possibly because these dialects have sufficiently distinct features. 3. **Error Analysis and Data Cleaning**: - An analysis of misclassified samples revealed that some errors were due to insufficient data cleaning, such as Standard Modern Greek texts being incorrectly classified into the dialect dataset. - Future work will further clean the data based on these error analyses to improve the dataset's validity and reliability. In summary, the paper aims to advance the research of natural language processing for Greek dialects by creating a large-scale modern Greek dialect dataset and validating the performance of different models in dialect identification tasks.

GRDD: A Dataset for Greek Dialectal NLP

RoDia: A New Dataset for Romanian Dialect Identification from Speech

A New Dataset for End-to-End Sign Language Translation: The Greek Elementary School Dataset

The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data

Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP

Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks, Datasets and Challenges

Natural Language Processing for Dialects of a Language: A Survey

GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek

Quantifying the Dialect Gap and its Correlates Across Languages

GREEK-BERT: The Greeks visiting Sesame Street

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

A Multimodal German Dataset for Automatic Lip Reading Systems and Transfer Learning

NLP for The Greek Language: A Longer Survey

Multi-granular Legal Topic Classification on Greek Legislation

Learning to Recognize Dialect Features

Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum

Experiments in Text Classification: Analyzing the Sentiment of Electronic Product Reviews in Greek

Logion: Machine Learning for Greek Philology

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography

Designing a System to Recognize Main Arabic Dialects