Abstract:Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.

ADIMA: Abuse Detection In Multilingual Audio

Multilingual and Multimodal Abuse Detection

Abusive Speech Detection in Indic Languages Using Acoustic Features

Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning

Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

Large scale annotated dataset for code-mix abusive short noisy text

CoLLAB: A Collaborative Approach for Multilingual Abuse Detection

IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

User-Aware Multilingual Abusive Content Detection in Social Media

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Breaking the Silence Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces

Transferring Audio Deepfake Detection Capability Across Languages

Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection

Developing a Multilingual Annotated Corpus of Misogyny and Aggression

D3CODE: Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation

IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

Svarah: Evaluating English ASR Systems on Indian Accents

Multilingual Abusiveness Identification on Code-Mixed Social Media Text

ADAM optimised human speech emotion recogniser based on statistical information distribution of chroma, MFCC, and MBSE features

Abusive Language Detection in Online User Content

Sound Check: Auditing Audio Datasets