Duration modeling for Chinese synthesis from C-toBI labeled corpus

Weibin Zhu,Liqin Shen,Xiaochuan Miu
DOI: https://doi.org/10.21437/icslp.2000-502
2000-01-01
Abstract:A set of labeling criteria, C-ToBI (Chinese Tone and Break Index) was redefined to annotate the prosodic event in continuous speech in a hierarchical structure. There’re 4 layers, i.e., intonational phrase, intermediate phrase, word, and syllable layer. The prosodic structure and break index and stress index tiers represent the core prosodic events of an utterance. The stress index represents the degree of accent of the constituents in each layer. The break tier represents the degree of the juncture of each pair of constituents in each layer. A duration model was built from a reading style corpus labeled with CToBI. The factors affecting the duration of a given segment come from two relatively independent levels. First, in segment level, the phoneme of the segment and the context do influence the duration. Second, in super-segment level, the influences come from multi-layers, which include the location and the degree of stress and break in different layers. Those factors with the property of directional invariance form the feature vector that was as the input of the linear duration model. And the model was part of a synthesis speech system, and its parameters were estimated by the statistic approach.
What problem does this paper attempt to address?