Abstract:Spoken content retrieval refers to directly indexing and retrieving spoken content based on the audio rather than text descriptions. This potentially eliminates the requirement of producing text descriptions for multimedia content for indexing and retrieval purposes, and is able to precisely locate the exact time the desired information appears in the multimedia. Spoken content retrieval has been very successfully achieved with the basic approach of cascading automatic speech recognition (ASR) with text information retrieval: after the spoken content is transcribed into text or lattice format, a text retrieval engine searches over the ASR output to find desired information. This framework works well when the ASR accuracy is relatively high, but becomes less adequate when more challenging real-world scenarios are considered, since retrieval performance depends heavily on ASR accuracy. This challenge leads to the emergence of another approach to spoken content retrieval: to go beyond the basic framework of cascading ASR with text retrieval in order to have retrieval performances that are less dependent on ASR accuracy. This overview article is intended to provide a thorough overview of the concepts, principles, approaches, and achievements of major technical contributions along this line of investigation. This includes five major directions: 1) Modified ASR for Retrieval Purposes: cascading ASR with text retrieval, but the ASR is modified or optimized for spoken content retrieval purposes; 2) Exploiting the Information not present in ASR outputs: to try to utilize the information in speech signals inevitably lost when transcribed into phonemes and words; 3) Directly Matching at the Acoustic Level without ASR: for spoken queries, the signals can be directly matched at the acoustic level, rather than at the phoneme or word levels, bypassing all ASR issues; 4) Semantic Retrieval of Spoken Content: trying to retrieve spoken content that is semanti- ally related to the query, but not necessarily including the query terms themselves; 5) Interactive Retrieval and Efficient Presentation of the Retrieved Objects: with efficient presentation of the retrieved objects, an interactive retrieval process incorporating user actions may produce better retrieval results and user experiences.

Improved Semantic Retrieval of Spoken Content by Language Models Enhanced with Acoustic Similarity Graph

Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs

Enhancing Query Expansion for Semantic Retrieval of Spoken Content with Automatically Discovered Acoustic Patterns.

Semantic Query Expansion and Context-Based Discriminative Term Modeling for Spoken Document Retrieval

Improved open-vocabulary spoken content retrieval with word and subword lattices using acoustic feature similarity

Towards Unsupervised Semantic Retrieval Of Spoken Content With Query Expansion Based On Automatically Discovered Acoustic Patterns

Open-Vocabulary Retrieval of Spoken Content with Shorter/Longer Queries Considering Word/Subword-based Acoustic Feature Similarity.

Improved Spoken Term Detection by Discriminative Training of Acoustic Models Based on User Relevance Feedback.

Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Improved Spoken Term Detection by Feature Space Pseudo-Relevance Feedback.

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval

Improved spoken term detection using support vector machines with acoustic and context features from pseudo-relevance feedback

Improved Spoken Term Detection with Graph-Based Re-Ranking in Feature Space

Improved Speech Summarization and Spoken Term Detection with Graphical Analysis of Utterance Similarities

LDA-Based Retrieval Framework for Semantic News Video Retrieval

Improved spoken term detection using support vector machines based on lattice context consistency

Integrating Recognition and Retrieval with User Feedback: A New Framework for Spoken Term Detection.

SemanticAC: Semantics-Assisted Framework for Audio Classification

Semantic indexing and document retrieval for personalized language modeling

Incorporating Symbolic Sequential Modeling for Speech Enhancement