Query-by-Example Spoken Term Detection using Attentive Pooling Networks

Kun Zhang,Zhiyong Wu,Jia Jia,Helen Meng,Binheng Song
DOI: https://doi.org/10.1109/APSIPAASC47483.2019.9023023
2019-01-01
Abstract:Query-by-example spoken term detection (QbE-STD) is attractive because its a key technology for retrieving and browsing spoken content without transcribing them into text. Several end-to-end models based on encoder architecture have been proposed for QbE-STD, in which the input pair, spoken query and audio segment, are first projected into fixed-length vector representations by feature extraction module and then similarity measure module is used to output detection score based on the representations. Attention mechanism has been applied into the feature extractor; however, traditional approach calculates attention vector for audio segment only, which makes it a one-way attention mechanism. In this paper, we present a novel feature extraction module based on two-way attention mechanism, called attentive pooling networks, for end-to-end QbE-STD. The main idea is to learn a similarity measure over the projected input pair and extract information in a way that two input items can directly influence the computation of each other's representation. Evaluations on the LibriSpeech corpus and cross-linguistic audio archive confirm the effectiveness of our proposed approach compared to the traditional ones.
What problem does this paper attempt to address?