Abstract:Text-to-speech synthesis (TTS) has been used as a data augmentation approach for automatic speech recognition (ASR), leveraging additional texts for ASR training. However, in low resource tasks, usually only a limited number of speakers are available, leading to the lack of speaker variations in synthetic speech. In this paper, we propose a novel speaker augmentation approach which can synthesize data with sufficient speaker and text diversity. Here, an end-to-end TTS system is trained with speaker representations from a variational auto-encoder (VAE), which enables TTS to synthesize speech from unseen new speakers via sampling from the trained latent distribution. As a new type of data augmentation approach, speaker augmentation can be combined with traditional feature augmentation approaches, such as SpecAugment. Experiments on a switchboard task show that, given 50 hours of data, the proposed speaker augmentation with SpecAugment significantly reduces word error rate (WER) by 30% relative compared to the system without any data augmentation, and about 18% relative compared to the system with SpecAugment.

Speaker Augmentation for Low Resource Speech Recognition