Abstract:Spoken term detection (STD) is a key technology for retrieval of spoken content, which will be very important to retrieve and browse multimedia content over the Internet. The discriminative capability of machine learning methods has recently been used to facilitate STD. This paper presents a new approach to improve STD using support vector machines (SVM) based on acoustic information. The concept of pseudo-relevance feedback (PRF) well used in the retrieval of text, image and video is used here. The basic idea of using PRF here is to assume some spoken segments in the first-pass retrieved results are relevant (or pseudo-relevant) and some others irrelevant (or pseudo-irrelevant), and take these segments as positive and negative examples to train a query-specific SVM. This SVM is then used for re-ranking the first-pass retrieved results, and only the re-ranked results are shown to the user. In this paper, feature vectors representing the spoken segments based on acoustic information to be used in SVM are considered and analyzed. Furthermore, conventionally in PRF the items with the highest and lowest scores in the first-pass retrieved results are respectively taken as pseudo-relevant and -irrelevant, but in this way some incorrect examples are inevitably included in the training data especially when the recognition accuracy is poor. Here we further propose an enhanced SVM which not only better selects positive/negative examples considering the reliability of the spoken segments, but emphasizes more on more reliable training examples by modifying the SVM formulation. Experiments on two different sets of spoken archives with different speaking styles and different levels of recognition accuracies demonstrated significant improvements offered by the proposed approaches.

Query-by-Example Spoken Term Detection using Attentive Pooling Networks

Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks

Neural Network based End-to-End Query by Example Spoken Term Detection

Query-by-example Spoken Term Detection Based on Phonetic Posteriorgram

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Hybrid Network Feature Extraction for Depression Assessment from Speech

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

Hypersphere Embedding and Additive Margin for Query-by-example Keyword Spotting

Attentive Pooling Networks.

Attention-Based Audio Embeddings for Query-by-Example

Learning Contextual Representation with Convolution Bank and Multi-head Self-attention for Speech Emphasis Detection.

MSQAT: A multi-dimension non-intrusive speech quality assessment transformer utilizing self-supervised representations

Essay-Anchor Attentive Multi-Modal Bilinear Pooling for Textbook Question Answering

Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping

A Nonparametric Bayesian Approach for Spoken Term detection by Example Query