Abstract:This paper presents a maximum-likelihood approach to multiple fundamental frequency (F0) estimation for a mixture of harmonic sound sources, where the power spectrum of a time frame is the observation and the F0s are the parameters to be estimated. When defining the likelihood model, the proposed method models both spectral peaks and non-peak regions (frequencies further than a musical quarter tone from all observed peaks). It is shown that the peak likelihood and the non-peak region likelihood act as a complementary pair. The former helps find F0s that have harmonics that explain peaks, while the latter helps avoid F0s that have harmonics in non-peak regions. Parameters of these models are learned from monophonic and polyphonic training data. This paper proposes an iterative greedy search strategy to estimate F0s one by one, to avoid the combinatorial problem of concurrent F0 estimation. It also proposes a polyphony estimation method to terminate the iterative process. Finally, this paper proposes a postprocessing method to refine polyphony and F0 estimates using neighboring frames. This paper also analyzes the relative contributions of different components of the proposed method. It is shown that the refinement component eliminates many inconsistent estimation errors. Evaluations are done on ten recorded four-part J. S. Bach chorales. Results show that the proposed method shows superior F0 estimation and polyphony estimation compared to two state-of-the-art algorithms.

Asynchronous F0 and Spectrum Modeling for HMM-based Speech Synthesis

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

Cross-Stream Dependency Modeling for HMM-Based Speech Synthesis

Multi-Layer F0 Modeling for HMM-Based Speech Synthesis

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

Cross-stream Dependency Modeling Using Continuous F0 Model for HMM-based Speech Synthesis

Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Amplitude Spectrum Based Excitation Model For Hmm-Based Speech Synthesis

Auditive Learning Based Chinese F0 Prediction

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

PHMM Based Asynchronous Acoustic Model for Chinese Large Vocabulary Continuous Speech Recognition

Clustering and Feature Learning Based F0 Prediction for Chinese Speech Synthesis

F0 Prediction Model of Speech Synthesis Based on Template and Statistical Method

Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions

Investigation of Prosodie FO Layers in Hierarchical FO Modeling for HMM-based Speech Synthesis

Robust F0 Modeling for Mandarin Speech Recognition in Noise.

A Novel Hybrid Approach for Mandarin Speech Synthesis

Modeling DCT Parameterized F0 Trajectory at Intonation Phrase Level with DNN or Decision Tree

Modeling Glottal Effect On The Spectral Envelop Of Straight Using Mixture Of Gaussians

F0 Transformation for Emotional Speech Synthesis Using Target Approximation Features and Bidirectional Associative Memories