An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM
Yibin Zheng,Zhengqi Wen,Bin Liu,Ya Li,Jianhua Tao
DOI: https://doi.org/10.1109/ICSP.2016.7877817
2016-01-01
Abstract:Accurate pitch extraction from speech is important but challenging problem for speech synthesis. However, the additive nature and long-term suprasegmental property of pitch features have not been fully exploited in most of the existing pitch estimators as they are operated frame by frame. As a result, they would cause some inherent discontinuities, such as double/half F0 errors and unvoiced/voiced (U/V) error. This would adversely affect the quality of synthetic speech as well as the expressiveness of the prosody information. In this paper, we explore the novel use of multi-tasks (Task 1: U/V; Task 2: Pitch) bidirectional long short-term memory recurrent neural network (BLSTM) to model the pitch and voicing decision simultaneously in a unified framework. The features used in this study are extracted from the frequency domain. We compute the log-frequency power spectrogram and then normalize to the long-term speech spectrum to attenuate noises. A filter is then used to enhance the harmonicity. Experiments show that the proposed approach substantially outperforms RAPT, which behaves the best in clean condition. Besides, our proposed approach can even work well with a certain level of background noise.