Lightspeech: Lightweight Non-Autoregressive Multi-Speaker Text-To-Speech

Song Li,Beibei Ouyang,Lin Li,Qingyang Hong
DOI: https://doi.org/10.1109/slt48900.2021.9383562
2021-01-01
Abstract:With the development of deep learning, end-to-end neural text-to-speech systems have achieved significant improvements on high-quality speech synthesis. However, most of these systems are attention-based autoregressive models, resulting in slow synthesis speed and large model parameters. In this paper, we propose a new lightweight non-autoregressive multi-speaker speech synthesis system, named LightSpeech, which utilizes the lightweight feedforward neural networks to accelerate synthesis and reduce the amount of parameters. With the speaker embedding, LightSpeech achieves multi-speaker speech synthesis extremely quickly. Experiments on the LibriTTS dataset show that, compared with FastSpeech, our smallest LightSpeech model achieves a 9.27x Mel-spectrogram generation acceleration on CPU, and the model size and parameters are compressed by 37.06x and 37.36x, respectively.
What problem does this paper attempt to address?