Abstract:Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the accuracy of User - Defined Keyword Spotting (UDKWS), especially the challenges encountered when distinguishing audio - text pairs with similar pronunciations. Specifically: 1. **Limitations of existing methods**: Traditional UDKWS methods rely on short - term spectral features (such as Mel - Frequency Cepstral Coefficients (MFCC)) to detect spoken keywords. However, these features have limited ability to capture the temporal dynamics of speech signals, resulting in difficulty in accurately identifying audio - text pairs with similar pronunciations. 2. **Introduction of Shifted Delta Coefficients (SDC)**: To overcome the above challenges, the authors propose to use Shifted Delta Coefficients (SDC). By incorporating long - term temporal information, SDC can better capture pronunciation changes (i.e., the transitions between connected phonemes), thereby improving the ability to distinguish words with similar pronunciations. 3. **Performance verification**: The authors compare the performance of SDC with other benchmark features (such as MFCC, Mel - spectrogram, etc.) through four different datasets and explore different configurations of SDC to find the time context most suitable for the UDKWS task. 4. **Experimental results**: The experimental results show that on the challenging LibriPhrase - hard dataset, the SDC feature improves by 8.32% and 8.69% in terms of Area Under the Curve (AUC) and Equal Error Rate (EER) respectively compared to the MFCC feature. In addition, compared to the existing state - of - the - art UDKWS techniques, the proposed SDC - based method also shows superior performance. ### Summary This paper aims to improve the performance of the user - defined keyword recognition system by introducing and optimizing the SDC feature, especially when dealing with keywords with similar pronunciations, significantly improving the accuracy and robustness of the system.

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

Bridging the Gap Between Audio and Text Using Parallel-Attention for User-Defined Keyword Spotting

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Open-vocabulary Keyword-spotting with Adaptive Instance Normalization

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

Matching Latent Encoding for Audio-Text based Keyword Spotting

CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting

Open vocabulary keyword spotting through transfer learning from speech synthesis

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Dark Experience for Incremental Keyword Spotting

Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Audio-visual Keyword Spotting Based on Adaptive Decision Fusion under Noisy Conditions for Human-Robot Interaction.

Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors