Abstract:For the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting (LRDWWS) Challenge, we introduce the PB-LRDWWS system. This system combines a dysarthric speech content feature extractor for prototype construction with a prototype-based classification method. The feature extractor is a fine-tuned HuBERT model obtained through a three-stage fine-tuning process using cross-entropy loss. This fine-tuned HuBERT extracts features from the target dysarthric speaker's enrollment speech to build prototypes. Classification is achieved by calculating the cosine similarity between the HuBERT features of the target dysarthric speaker's evaluation speech and prototypes. Despite its simplicity, our method demonstrates effectiveness through experimental results. Our system achieves second place in the final Test-B of the LRDWWS Challenge.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop an effective Wake - Up Word Spotting (WWS) system for individuals with dysarthria under low - resource conditions. Specifically, in response to the SLT 2024 Low - Resource Dysarthria Wake - Up Word Spotting (LRDWWS) Challenge, the paper proposes a system named PB - LRDWWS. This system aims to build a model that can accurately identify the wake - up words of target dysarthria patients with limited registered voice data. ### Main problems and challenges 1. **The particularity of dysarthria**: - The pronunciation patterns of dysarthria patients vary greatly due to factors such as age, cause, severity, and speaking style. - These patients usually have difficulty controlling their articulatory organs, resulting in unclear and unfluent speech, which increases the difficulty of speech recognition. 2. **Low - resource environment**: - The challenge requires using limited registered voice data to train the model, which places higher requirements on the generalization ability and robustness of the model. - There may be an imbalance between keyword and non - keyword categories in the registered voice data, which is likely to lead to over - fitting of the model or poor performance. 3. **Real - time and accuracy**: - The wake - up word detection system needs to achieve high - precision keyword recognition while ensuring low power consumption to reduce the false positive rate and the false negative rate. ### Solutions The PB - LRDWWS system proposed in the paper combines the following key technologies: 1. **Feature extractor**: - Use the pre - trained HuBERT model and obtain a feature extractor specifically for dysarthria speech through a three - stage fine - tuning process. This includes: - Stage 1: Fine - tune the HuBERT model using non - dysarthria voice data (Control data) to build a speaker - independent control model (SIC). - Stage 2: Further fine - tune the SIC model using the voice data of multiple dysarthria speakers (Uncontrol data) to build a speaker - independent dysarthria model (SID). - Stage 3: Fine - tune the SID model using the voice data of the target dysarthria speaker (Target data) to build a speaker - related dysarthria model (SDD). 2. **Prototype classification method**: - Use the fine - tuned HuBERT model to extract features from the registered voices of the target dysarthria speaker and build prototypes. - In the inference stage, classify by calculating the cosine similarity between the test - voice features and each prototype. 3. **Data augmentation and loss function optimization**: - Explore multiple data augmentation techniques (such as using a text - to - speech synthesis system to generate keyword data) and different loss function settings (such as CTC loss, cross - entropy loss, and supervised contrastive learning loss) to improve the stability and performance of the model. ### Experimental results The experimental results show that the PB - LRDWWS system has achieved a significant performance improvement on the final test set Test - B, with a score of only 0.009801, which is significantly better than the baseline model and won the second place in the LRDWWS challenge. Through these methods, the paper has successfully solved the key problems of dysarthria wake - up word detection under low - resource conditions and provided an effective solution to help dysarthria patients better use voice - controlled devices.

PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge

Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

The NPU System for the 2020 Personalized Voice Trigger Challenge

A Strategic Approach for Robust Dysarthric Speech Recognition

Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design

The USTC System for Blizzard Machine Learning Challenge 2017-ES2

The USYD-JD Speech Translation System for IWSLT 2021

XWSB: A Blend System Utilizing XLS-R and WavLM with SLS Classifier detection system for SVDD 2024 Challenge

The Iflytek System for Blizzard Machine Learning Challenge 2017-ES1

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

WavLLM: Towards Robust and Adaptive Speech Large Language Model

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction

The NLPR Speech Synthesis Entry for Blizzard Challenge 2020

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System