End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

Kesavaraj V,Anuprabha M,Anil Kumar Vuppala
2024-05-23
Abstract:Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the accuracy of User - Defined Keyword Spotting (UDKWS), especially the challenges encountered when distinguishing audio - text pairs with similar pronunciations. Specifically: 1. **Limitations of existing methods**: Traditional UDKWS methods rely on short - term spectral features (such as Mel - Frequency Cepstral Coefficients (MFCC)) to detect spoken keywords. However, these features have limited ability to capture the temporal dynamics of speech signals, resulting in difficulty in accurately identifying audio - text pairs with similar pronunciations. 2. **Introduction of Shifted Delta Coefficients (SDC)**: To overcome the above challenges, the authors propose to use Shifted Delta Coefficients (SDC). By incorporating long - term temporal information, SDC can better capture pronunciation changes (i.e., the transitions between connected phonemes), thereby improving the ability to distinguish words with similar pronunciations. 3. **Performance verification**: The authors compare the performance of SDC with other benchmark features (such as MFCC, Mel - spectrogram, etc.) through four different datasets and explore different configurations of SDC to find the time context most suitable for the UDKWS task. 4. **Experimental results**: The experimental results show that on the challenging LibriPhrase - hard dataset, the SDC feature improves by 8.32% and 8.69% in terms of Area Under the Curve (AUC) and Equal Error Rate (EER) respectively compared to the MFCC feature. In addition, compared to the existing state - of - the - art UDKWS techniques, the proposed SDC - based method also shows superior performance. ### Summary This paper aims to improve the performance of the user - defined keyword recognition system by introducing and optimizing the SDC feature, especially when dealing with keywords with similar pronunciations, significantly improving the accuracy and robustness of the system.