Abstract:<p>Keyword search (KWS) means searching for keywords given by the user from continuous speech. Conventional KWS systems are based on Automatic Speech Recognition (ASR), where the input speech has to be first processed by the ASR system before keyword searching. In the recent decade, as deep learning and deep neural networks (DNN) become increasingly popular, KWS systems can also be trained in an end-to-end (E2E) manner. The main advantage of E2E KWS is that there is no need for speech recognition, which makes the training and searching procedure much more straightforward than the traditional ones. This article proposes an E2E KWS model, which consists of four parts: speech encoder-decoder, query encoder-decoder, attention mechanism, and energy scorer. Firstly, the proposed model outperforms the baseline model. Secondly, we find that under various supervision, character or phoneme sequences, speech or query encoders can extract the corresponding information, resulting in different performances. Moreover, we introduce an attention mechanism and invent a novel energy scorer, where the former can help locate keywords. The latter can make final decisions by considering speech embeddings, query embeddings, and attention weights in parallel. We evaluate our model on low resource conditions with about 10-hour training data for four different languages. The experiment results prove that the proposed model can work well on low resource conditions.</p>

End-To-End Topic Classification Without Asr

End-to-end Speech Topic Classification Based on Pre-Trained Model Wavlm

Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features

End-to-end keywords spotting based on connectionist temporal classification for Mandarin

End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder.

Cascaded CNN-resBiLSTM-CTC: an End-to-End Acoustic Model for Speech Recognition.

Towards Unsupervised Speech Recognition Without Pronunciation Models

End-to-End Mandarin Tone Classification with Short Term Context Information

Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model

End-to-End Architectures for Speech Recognition

End-to-end Monaural Multi-speaker ASR System Without Pretraining.

Speech Topic Classification Based on Pre-trained and Graph Networks.

End-to-End Speech Recognition with Pre-trained Masked Language Model

Learning the Front-End Speech Feature with Raw Waveform for End-to-End Speaker Recognition

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

Speech Recognition for Air Traffic Control Via Feature Learning and End-to-end Training

Automatic speech recognition based on time domain modeling

Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR

End-to-end keyword search system based on attention mechanism and energy scorer for low resource languages

CAT: CRF-based ASR Toolkit