Abstract:Spoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 h of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.

Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples

Improved spoken term detection using support vector machines with acoustic and context features from pseudo-relevance feedback

Improved Spoken Term Detection by Feature Space Pseudo-Relevance Feedback.

Improved Spoken Term Detection by Discriminative Training of Acoustic Models Based on User Relevance Feedback.

Improved open-vocabulary spoken content retrieval with word and subword lattices using acoustic feature similarity

Evolutionary Discriminative Confidence Estimation for Spoken Term Detection

Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks

Improved spoken term detection using support vector machines based on lattice context consistency

A Study of Discriminatory Speech Classification Based on Improved Smote and SVM-RF

Semantic Query Expansion and Context-Based Discriminative Term Modeling for Spoken Document Retrieval

A Framework Integrating Different Relevance Feedback Scenarios and Approaches for Spoken Term Detection.

Query-by-example Spoken Term Detection Based on Phonetic Posteriorgram

Improved Spoken Term Detection with Graph-Based Re-Ranking in Feature Space

Stochastic Pronunciation Modeling for Out-of-Vocabulary Spoken Term Detection

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs

Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages.

Improved Semantic Retrieval of Spoken Content by Language Models Enhanced with Acoustic Similarity Graph

Feature Analysis for Discriminative Confidence Estimation in Spoken Term Detection

Query-by-Example Spoken Term Detection using Attentive Pooling Networks