Abstract:A cornerstone in AI research has been the creation and adoption of standardized training and test datasets to earmark the progress of state-of-the-art models. A particularly successful example is the GLUE dataset for training and evaluating Natural Language Understanding (NLU) models for English. The large body of research around self-supervised BERT-based language models revolved around performance improvements on NLU tasks in GLUE. To evaluate language models in other languages, several language-specific GLUE datasets were created. The area of speech language understanding (SLU) has followed a similar trajectory. The success of large self-supervised models such as wav2vec2 enable creation of speech models with relatively easy to access unlabelled data. These models can then be evaluated on SLU tasks, such as the SUPERB benchmark. In this work, we extend this to Indic languages by releasing the IndicSUPERB benchmark. Specifically, we make the following three contributions. (i) We collect Kathbath containing 1,684 hours of labelled speech data across 12 Indian languages from 1,218 contributors located in 203 districts in India. (ii) Using Kathbath, we create benchmarks across 6 speech tasks: Automatic Speech Recognition, Speaker Verification, Speaker Identification (mono/multi), Language Identification, Query By Example, and Keyword Spotting for 12 languages. (iii) On the released benchmarks, we train and evaluate different self-supervised models alongside a commonly used baseline FBANK. We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks, including a large gap of 76\% for the Language Identification task. However, for speaker identification, self-supervised models trained on large datasets demonstrate an advantage. We hope IndicSUPERB contributes to the progress of developing speech language understanding models for Indian languages.

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum

Natural Language Processing for Dialects of a Language: A Survey

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

Experiences from Creating a Benchmark for Sentiment Classification for Varieties of English

Multi-VALUE: A Framework for Cross-Dialectal English NLP

One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

XferBench: a Data-Driven Benchmark for Emergent Language

Dynabench: Rethinking Benchmarking in NLP

VALUE: Understanding Dialect Disparity in NLU

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

Quantifying the Dialect Gap and its Correlates Across Languages

Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks, Datasets and Challenges

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback

Evaluating Dialect Robustness of Language Models via Conversation Understanding

AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

AudioBench: A Universal Benchmark for Audio Large Language Models

VoiceBench: Benchmarking LLM-Based Voice Assistants

Benchmarking Linguistic Diversity of Large Language Models

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages