Abstract:The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Singaporean English (Singlish), and African-American English (AAE) can be commonly found online. These varieties pose a challenge to existing natural language processing (NLP) tools as they often differ orthographically and syntactically from standard English for which the majority of these tools are built. NLP models trained on standard English texts produced biased outcomes for users of underrepresented varieties. Some research has aimed to overcome the inherent biases caused by unrepresentative data through techniques like data augmentation or adjusting training models.

What problem does this paper attempt to address?

This paper focuses on the inclusivity issues of different English variations in natural language processing (NLP). Currently, social media platforms such as Twitter provide opportunities to collect and analyze various English variations. However, most NLP tools are primarily built for standard English and perform poorly on variants such as Indian English, Singaporean English, and African American English, which may result in algorithmic biases. The researchers aim to address the biases inherent in the data by creating a tweet dataset that includes users from countries where Asian and African English variants are predominant. They propose a six-class annotation framework to measure the degree of standard English and indirectly reveal the performance of English variants in tweets. The dataset is annotated by linguists familiar with the major English variants in the respective regions and consists of 170,800 tweets from seven countries. The study shows variations in accuracy between pre-trained language models when differentiating between Western English and non-Western (i.e., non-standard) English variants. The paper points out that due to historical colonization and Western influences, English has become a global lingua franca with different regional features. These features may manifest in spelling, grammar, and lexicon usage. However, current datasets mainly represent samples from the United States and the United Kingdom, neglecting English variations in other regions. This leads to technological biases in speech recognition, language identification, and other fields, impacting service fairness. The researchers evaluate the performance gap of pre-trained language identifiers in processing non-Western English variants by analyzing the language features in tweets. Their goal is to increase data diversity, reduce implicit demographic biases in NLP, and contribute to the creation of more equitable NLP systems. The paper also discusses potential future applications and research directions, as well as the limitations of existing datasets.

Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

From Genesis to Creole Language

Investigating African-American Vernacular English in Transformer-Based Text Generation

Linguistic Diversities of Demographic Groups in Twitter

The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR

Multi-VALUE: A Framework for Cross-Dialectal English NLP

Using social media to infer the diffusion of an urban contact dialect: A case study of Multicultural London English

Dialect Diversity in Text Summarization on Twitter

Characterizing English Variation across Social Media Communities with BERT

Global Voices, Local Biases: Socio-Cultural Prejudices across Languages

CCAE: A Corpus of Chinese-based Asian Englishes

Experiences from Creating a Benchmark for Sentiment Classification for Varieties of English

Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Sociolinguistic Analysis with Missing Metadata? Leveraging Linguistic and Semiotic Resources Through Deep Learning to Investigate English Variation and Change on Twitter

Towards a Deep Multi-layered Dialectal Language Analysis: A Case Study of African-American English

The Use of English on Social Media: Deviation or Variation?

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

TLA: Twitter Linguistic Analysis

Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties