Abstract:Spoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 h of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.

Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

Query-by-Example Spoken Term Detection using Attentive Pooling Networks

Query-by-example Spoken Term Detection Based on Phonetic Posteriorgram

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks

Neural Network based End-to-End Query by Example Spoken Term Detection

Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

Discriminative Acoustic Word Embeddings: Recurrent Neural Network-Based Approaches

Semantic-based Sound Retrieval by ERP in Rapid Serial Auditory Presentation Paradigm.

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Evolutionary Discriminative Confidence Estimation for Spoken Term Detection

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks

Unsupervised Discovery of Structured Acoustic Tokens with Applications to Spoken Term Detection

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning