Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

Takenori Yoshimura,Shinji Takaki,Kazuhiro Nakamura,Keiichiro Oura,Yukiya Hono,Kei Hashimoto,Yoshihiko Nankaku,Keiichi Tokuda
DOI: https://doi.org/10.48550/arXiv.2211.11222
2022-11-21
Abstract:This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?