Disentangled Adversarial Domain Adaptation for Phonation Mode Detection in Singing and Speech

Yixin Wang,Wei Wei,Xiangming Gu,Xiaohong Guan,Ye Wang
DOI: https://doi.org/10.1109/TASLP.2023.3317568
2023-01-01
Abstract:Phonation mode detection predicts phonation modes and their temporal boundaries in singing and speech, holding promise for characterizing voice quality and vocal health. However, it is very challenging due to the domain disparities between training data and unannotated real-world recordings. To tackle this problem, we develop a disentangled adversarial domain adaptation network, which adapts the phonation mode detection model with the structure of the convolutional recurrent neural network pre-trained on the source domain to the target domain without phonation mode labels. Based on our curated sung and spoken dataset for phonation mode detection, we demonstrate that the subject and the singing-speech mismatches cause performance decline. By disentangling domain-invariant phonation mode and domain-specific embeddings, our method greatly enhances the effectiveness and explainability of unsupervised adversarial domain adaptation. Experiments show that the performance drop caused by the subject mismatch is alleviated via adaptation, resulting in 44.7% and 6.8% improvement of the F-score for singing and speech, respectively. The singing and speech domain adaptation experiment indicates that a model trained on singing data can be adapted to speech, yielding an F-score of 0.56, commensurate with the F-score of 0.59 achieved using a model exclusively trained on speech data. By further investigating the disentangled embeddings, we find that the phonation mode feature shared by singing and speech is invariant to pitch. These results inspire reliable and versatile applications in voice quality evaluation and paralinguistic information retrieval.
What problem does this paper attempt to address?