Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling

Yuanchao Li,Zixing Zhang,Jing Han,Peter Bell,Catherine Lai
2024-09-27
Abstract:The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.
Audio and Speech Processing,Artificial Intelligence,Computation and Language,Multimedia,Sound
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient labeled data in speech classification tasks, particularly in cognitive state classification tasks. Specifically, the paper proposes a Semi-Supervised Learning (SSL) framework that introduces a multi-view pseudo-label generation method. This method leverages acoustic and linguistic features of unlabeled data to select the most reliable data for model training. The aim of this approach is to reduce the dependency on a large amount of labeled data and improve the model's performance in tasks such as emotion recognition and dementia detection. The main contributions of the paper include: 1. **Proposing a novel semi-supervised learning framework**: This framework combines a multi-view pseudo-label generation method, effectively utilizing acoustic and linguistic features to select high-quality unlabeled data. 2. **Using Fréchet Audio Distance (FAD)**: As a reference-free method, it is used to cluster unlabeled data based on acoustic similarity. 3. **Utilizing task-specific prompts to predict labels**: Predicting labels from automatic speech recognition (ASR) transcriptions, learning knowledge in acoustic, linguistic, and psychological aspects. 4. **Exploring various fusion methods**: Building a bimodal classifier within the semi-supervised learning framework to enhance the model's performance. Experimental results show that this method achieves performance comparable to fully supervised learning when using only 30% of labeled data and significantly outperforms selected baseline methods. This has been validated in emotion recognition and dementia detection tasks.