PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge

Shiyao Wang,Jiaming Zhou,Shiwan Zhao,Yong Qin
2024-12-06
Abstract:For the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting (LRDWWS) Challenge, we introduce the PB-LRDWWS system. This system combines a dysarthric speech content feature extractor for prototype construction with a prototype-based classification method. The feature extractor is a fine-tuned HuBERT model obtained through a three-stage fine-tuning process using cross-entropy loss. This fine-tuned HuBERT extracts features from the target dysarthric speaker's enrollment speech to build prototypes. Classification is achieved by calculating the cosine similarity between the HuBERT features of the target dysarthric speaker's evaluation speech and prototypes. Despite its simplicity, our method demonstrates effectiveness through experimental results. Our system achieves second place in the final Test-B of the LRDWWS Challenge.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop an effective Wake - Up Word Spotting (WWS) system for individuals with dysarthria under low - resource conditions. Specifically, in response to the SLT 2024 Low - Resource Dysarthria Wake - Up Word Spotting (LRDWWS) Challenge, the paper proposes a system named PB - LRDWWS. This system aims to build a model that can accurately identify the wake - up words of target dysarthria patients with limited registered voice data. ### Main problems and challenges 1. **The particularity of dysarthria**: - The pronunciation patterns of dysarthria patients vary greatly due to factors such as age, cause, severity, and speaking style. - These patients usually have difficulty controlling their articulatory organs, resulting in unclear and unfluent speech, which increases the difficulty of speech recognition. 2. **Low - resource environment**: - The challenge requires using limited registered voice data to train the model, which places higher requirements on the generalization ability and robustness of the model. - There may be an imbalance between keyword and non - keyword categories in the registered voice data, which is likely to lead to over - fitting of the model or poor performance. 3. **Real - time and accuracy**: - The wake - up word detection system needs to achieve high - precision keyword recognition while ensuring low power consumption to reduce the false positive rate and the false negative rate. ### Solutions The PB - LRDWWS system proposed in the paper combines the following key technologies: 1. **Feature extractor**: - Use the pre - trained HuBERT model and obtain a feature extractor specifically for dysarthria speech through a three - stage fine - tuning process. This includes: - Stage 1: Fine - tune the HuBERT model using non - dysarthria voice data (Control data) to build a speaker - independent control model (SIC). - Stage 2: Further fine - tune the SIC model using the voice data of multiple dysarthria speakers (Uncontrol data) to build a speaker - independent dysarthria model (SID). - Stage 3: Fine - tune the SID model using the voice data of the target dysarthria speaker (Target data) to build a speaker - related dysarthria model (SDD). 2. **Prototype classification method**: - Use the fine - tuned HuBERT model to extract features from the registered voices of the target dysarthria speaker and build prototypes. - In the inference stage, classify by calculating the cosine similarity between the test - voice features and each prototype. 3. **Data augmentation and loss function optimization**: - Explore multiple data augmentation techniques (such as using a text - to - speech synthesis system to generate keyword data) and different loss function settings (such as CTC loss, cross - entropy loss, and supervised contrastive learning loss) to improve the stability and performance of the model. ### Experimental results The experimental results show that the PB - LRDWWS system has achieved a significant performance improvement on the final test set Test - B, with a score of only 0.009801, which is significantly better than the baseline model and won the second place in the LRDWWS challenge. Through these methods, the paper has successfully solved the key problems of dysarthria wake - up word detection under low - resource conditions and provided an effective solution to help dysarthria patients better use voice - controlled devices.