Abstract:Information retrieval from spoken audio has attracted the attention of a number of research groups, in part driven by the recent NIST Spoken Term Detection (STD) evaluation. A common approach is to split the task into two stages. In the first, a large vocabulary continuous speech recognition (LVCSR) system is used to generate a word or phone lattice corresponding to the audio, and in the second, lattice search is used to determine likely occurrences of the search terms. Searching a word-based lattice works well for terms which occur in the LVCSR system's vocabulary. However, search terms naturally have a tendency toward proper nouns, which leads to higher out-of-vocabulary (OOV) rates than found in transcription tasks. A standard method for dealing with OOV terms is to generate a phone sequence corresponding to the terms, which may be then be searched for in a phone lattice. In this work, we propose using context-dependent graphemes (CDG) as sub-word units for spoken term detection, in particular for out-of-vocabulary search terms. In essence, this approach moves pronunciation modelling away from the letter-to-sound rules which are used to generate phone strings, and into the Gaus-sian mixture models which describe the observation space. This removes the need to make potentially error-prone hard decisions at an early stage of processing. In addition, words which have multiple pronunciations have a single grapheme representation which simplifies the subsequent search. Large text corpora can be used to train long-span grapheme-based language models for use in lattice generation. These language models have words implicit within them, though given suitable smoothing can be used to support previously unseen words. In this work, we first present the results of phone and grapheme recognition, in addition to word recognition based on phone and grapheme sub-word units. On the RT04s independent headset microphone (IHM) test condition, we find word error rate (WER) using phone sub-word units lower than that with graphemes, 44.5% compared to 54.5%. The phone error rate (PER) is 48.2%, slightly higher than the grapheme error rate (GER) of 46.3%, though these are not directly comparable as there are fewer graphemes than phones. We then present results on a spoken term detection (STD) task. Again using the RT04s test set, 78 in-vocabulary words and 64 out-of-vocabulary words were selected as search terms from the reference transcription. HTK was used to generate word or sub-word lattices, and a tool developed at Brno [1] used to …

Query-by-example Spoken Term Detection Based on Phonetic Posteriorgram

Neural Network based End-to-End Query by Example Spoken Term Detection

Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks

A Nonparametric Bayesian Approach for Spoken Term detection by Example Query

Query-by-Example Spoken Term Detection using Attentive Pooling Networks

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping

Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

Investigation of Multilingual Deep Neural Networks for Spoken Term Detection.

Grapheme-based Spoken Term Detection in the Meetings Domain Extended abstract submitted to MLMI-07

Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

Stochastic Pronunciation Modeling for Out-of-Vocabulary Spoken Term Detection

A Posterior Probability-Based System Hybridisation and Combination for Spoken Term Detection

Spoken Term Detection Using Dynamic Match Subword Confusion Network

Hypersphere Embedding and Additive Margin for Query-by-example Keyword Spotting

Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs