Abstract:Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.

Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments

Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework

A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages

The Importance of Accurate Alignments in End-to-End Speech Synthesis

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning

Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems

Towards Building ASR Systems for the Next Billion Users

Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis

End to End Bangla Speech Synthesis

Deep Learning Based TTS-STT Model with Transliteration for Indic Languages

Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning

DNN-based Speech Synthesis for Indian Languages from ASCII text

Everyday Speech in the Indian Subcontinent

Enhancing Prosodic Features by Adopting Pre-trained Language Model in Bahasa Indonesia Speech Synthesis

Deep Learning based Multilingual Speech Synthesis using Multi Feature Fusion Methods

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

Enhancing audio quality for expressive Neural Text-to-Speech

IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS