DeepGAN: A Fast and High-Quality Time-Domain-based Neural Vocoder for Low-Resource Scenarios

Yuan Jiang,Shun Bao,Yajun Hu,Li-Juan Liu,Guo-Ping Hu,Yang Ai,Zhenhua Ling
DOI: https://doi.org/10.1145/3653876.3653893
2024-01-01
Abstract:Recent advancements in neural vocoders have primarily relied on generative adversarial networks (GANs) operating in the time domain. However, these vocoders are parameter-heavy and computationally expensive, limiting their use in resource-constrained environments such as embedded devices. Depthwise separable convolution, known for its lower parameter count and reduced computational costs, can be employed to construct lightweight networks. In this paper, we introduce an extension to HiFi-GAN, named DeepGAN, which utilizes depthwise separable convolution as the primary unit within the network, introduces a novel upsample module, and incorporates a lightweight excitation generation network to enhance the quality of the generated speech. Both objective and subjective evaluations demonstrate that our proposed DeepGAN achieves comparable results to competing vocoders for both seen and unseen speakers. Notably, the parameter count of DeepGAN is only 1/7 of that of HiFi-GAN, resulting in an approximately sixfold improvement in generation speed, while maintaining the synthesized speech quality.
What problem does this paper attempt to address?