Enhancing Audio Retrieval with Attention-based Encoder for Audio Feature Representation

Feiyang Xiao,Qiaoxi Zhu,Jian Guan,Wenwu Wang
DOI: https://doi.org/10.23919/eusipco58844.2023.10290096
2023-01-01
Abstract:Pretrained audio neural networks (PANNs) has been successful in a range of machine audition applications. But its limitation in recognising relationships between acoustic scenes and events impacts its performance in language-based audio retrieval, which retrieves audio signals from a dataset based on natural language textual queries. This paper proposes the attention-based audio encoder to exploit contextual associations between acoustic scenes/events, using self-attention or graph attention with different loss functions for language-based audio retrieval. Our experimental results show that the proposed attention-based method outperforms most of state-of-the-art methods, with self-attention performing better than graph attention. In addition, the selection of different loss functions (i.e., NT-Xent loss or supervised contrastive loss) does not have as significant an impact on the results as the selection of the attention strategy.
What problem does this paper attempt to address?