Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties

Nhi Pham,Lachlan Pham,Adam L. Meyers
2024-01-21
Abstract:The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Singaporean English (Singlish), and African-American English (AAE) can be commonly found online. These varieties pose a challenge to existing natural language processing (NLP) tools as they often differ orthographically and syntactically from standard English for which the majority of these tools are built. NLP models trained on standard English texts produced biased outcomes for users of underrepresented varieties. Some research has aimed to overcome the inherent biases caused by unrepresentative data through techniques like data augmentation or adjusting training models.
Computation and Language,Computers and Society
What problem does this paper attempt to address?
This paper focuses on the inclusivity issues of different English variations in natural language processing (NLP). Currently, social media platforms such as Twitter provide opportunities to collect and analyze various English variations. However, most NLP tools are primarily built for standard English and perform poorly on variants such as Indian English, Singaporean English, and African American English, which may result in algorithmic biases. The researchers aim to address the biases inherent in the data by creating a tweet dataset that includes users from countries where Asian and African English variants are predominant. They propose a six-class annotation framework to measure the degree of standard English and indirectly reveal the performance of English variants in tweets. The dataset is annotated by linguists familiar with the major English variants in the respective regions and consists of 170,800 tweets from seven countries. The study shows variations in accuracy between pre-trained language models when differentiating between Western English and non-Western (i.e., non-standard) English variants. The paper points out that due to historical colonization and Western influences, English has become a global lingua franca with different regional features. These features may manifest in spelling, grammar, and lexicon usage. However, current datasets mainly represent samples from the United States and the United Kingdom, neglecting English variations in other regions. This leads to technological biases in speech recognition, language identification, and other fields, impacting service fairness. The researchers evaluate the performance gap of pre-trained language identifiers in processing non-Western English variants by analyzing the language features in tweets. Their goal is to increase data diversity, reduce implicit demographic biases in NLP, and contribute to the creation of more equitable NLP systems. The paper also discusses potential future applications and research directions, as well as the limitations of existing datasets.