Japanese Neural Incremental Text-to-Speech Synthesis Framework With an Accent Phrase Input

Tomoya Yanagita,Sakriani Sakti,Satoshi Nakamura
DOI: https://doi.org/10.1109/access.2023.3251657
IF: 3.9
2023-01-01
IEEE Access
Abstract:Work in the development of neural incremental text-to-speech (iTTS), which is attracting increasing attention, has recently pursued low-latency processing by generating speech on the fly before reading complete sentences. Most current state-of-the-art iTTS systems use a prefix-to-prefix neural iTTS framework with look-ahead of 1-2 unit segments (i.e., phonemes or words). However, since the Japanese language is based on accent phrase units that are longer than words, using a prefix-to-prefix neural iTTS with a look-ahead approach increases latency. Here, we propose an alternative to the end-to-end neural iTTS architecture that does not apply look-ahead input when synthesizing speech chunks. We further propose a method to use information from the previous time step by connecting the synthesized vector and the model’s internal state to the current time step. We experimentally investigated the latency of various iTTS systems with different modeling and synthesis chunks. The experimental results show that, for Japanese, the proposed iTTS is able to synthesize better speech quality, with a similar latency range, than the conventional baseline prefix-to-prefix neural iTTS with word units. Moreover, we found that our proposed approach improved the prosodic naturalness among synthesized units in the Japanese language. Subjective evaluations also revealed that the proposed approach with an incremental unit of two accent phrases achieved the best scores in Japanese iTTS systems.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?