Abstract:This paper addresses the persistent challenge in Keyword Spotting (KWS), a fundamental component in speech technology, regarding the acquisition of substantial labeled data for training. Given the difficulty in obtaining large quantities of positive samples and the laborious process of collecting new target samples when the keyword changes, we introduce a novel approach combining unsupervised contrastive learning and a unique augmentation-based technique. Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks with limited labeled data sets. We also propose that similar high-level feature representations should be employed for speech utterances with the same keyword despite variations in speed or volume. To achieve this, we present a speech augmentation-based unsupervised learning method that utilizes the similarity between the bottleneck layer feature and the audio reconstructing information for auxiliary training. Furthermore, we propose a compressed convolutional architecture to address potential redundancy and non-informative information in KWS tasks, enabling the model to simultaneously learn local features and focus on long-term information. This method achieves strong performance on the Google Speech Commands V2 Dataset. Inspired by recent advancements in sign spotting and spoken term detection, our method underlines the potential of our contrastive learning approach in KWS and the advantages of Query-by-Example Spoken Term Detection strategies. The presented CAB-KWS provide new perspectives in the field of KWS, demonstrating effective ways to reduce data collection efforts and increase the system's robustness.

Automatic detection of contrastive word pairs using textual and acoustic features

Detection and Emphatic Realization of Contrastive Word Pairs for Expressive Text-to-speech Synthesis

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

Using Conditional Random Fields to Predict Focus Word Pair in Spontaneous Spoken English

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

Improved spoken term detection using support vector machines with acoustic and context features from pseudo-relevance feedback

A Study of Discriminatory Speech Classification Based on Improved Smote and SVM-RF

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Acoustic features prominence based Chinese question detection

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Speaker-Text Retrieval via Contrastive Learning

Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology

Learning Contextual Representation with Convolution Bank and Multi-head Self-attention for Speech Emphasis Detection.

Automatic Error Detection For Unit Selection Speech Synthesis Using Log Likelihood Ratio Based Svm Classifier

Objective Evaluation Methods for Chinese Text-To-Speech Systems

Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

Visual Features Extracting & Selecting For Lipreading

Automatic Pitch Accent Detection Using Auto-Context with Acoustic Features.

A Scheme Discriminating Between Synthetic Speech and Normal Speech