Abstract:Phonation mode detection predicts phonation modes and their temporal boundaries in singing and speech, holding promise for characterizing voice quality and vocal health. However, it is very challenging due to the domain disparities between training data and unannotated real-world recordings. To tackle this problem, we develop a disentangled adversarial domain adaptation network, which adapts the phonation mode detection model with the structure of the convolutional recurrent neural network pre-trained on the source domain to the target domain without phonation mode labels. Based on our curated sung and spoken dataset for phonation mode detection, we demonstrate that the subject and the singing-speech mismatches cause performance decline. By disentangling domain-invariant phonation mode and domain-specific embeddings, our method greatly enhances the effectiveness and explainability of unsupervised adversarial domain adaptation. Experiments show that the performance drop caused by the subject mismatch is alleviated via adaptation, resulting in 44.7% and 6.8% improvement of the F-score for singing and speech, respectively. The singing and speech domain adaptation experiment indicates that a model trained on singing data can be adapted to speech, yielding an F-score of 0.56, commensurate with the F-score of 0.59 achieved using a model exclusively trained on speech data. By further investigating the disentangled embeddings, we find that the phonation mode feature shared by singing and speech is invariant to pitch. These results inspire reliable and versatile applications in voice quality evaluation and paralinguistic information retrieval.

Unsupervised Adaptation with Adversarial Dropout Regularization for Robust Speech Recognition

Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition

Unsupervised Domain Adaptation for Robust Speech Recognition via Variational Autoencoder-Based Data Augmentation

Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization

Disentangled Adversarial Domain Adaptation for Phonation Mode Detection in Singing and Speech

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification

Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

Cross-lingual Text-independent Speaker Verification using Unsupervised Adversarial Discriminative Domain Adaptation

Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation

Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Batch Normalization based Unsupervised Speaker Adaptation for Acoustic Models

Unsupervised Regularization-Based Adaptive Training for Speech Recognition

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

Domain Adaptation Using Suitable Pseudo Labels for Speech Enhancement and Dereverberation

Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition

DropClass and DropAdapt: Dropping classes for deep speaker representation learning

Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification

Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition

Speaker-Invariant Training Via Adversarial Learning.