Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval

Dong Zhou,Fang Lei,Lin Li,Yongmei Zhou,Aimin Yang
DOI: https://doi.org/10.1109/taslp.2024.3358048
2024-01-01
Abstract:The task of retrieving audio content relevant to lyric queries and vice versa plays a critical role in music-oriented applications. In this process, robust feature representations have to be learned for two modalities. Furthermore, interactions between different modalities should be properly captured at a fine-grained level. Existing approaches can effectively extract modal representations and perform retrieving between different modalities through alignment. However, these approaches model interactions between audio and lyrics in a coarse-grained manner. Especially the input features and interactions between enhanced representations produced by the alignment module are largely ignored, resulting in low-quality modality representations for final retrieval. This paper presents a novel method named CMRF that accomplishes cross-modal interactions via a reinforcement feedback procedure to learn high-quality multi-modal embeddings. Initially, we implicitly assimilate representations across distinct modalities via directional pairwise cross-modal attention. Subsequently, our approach recurrently identifies pivotal constituents within these elevated-level attributes to engage with the primary input features via reinforcement learning, thus augmenting the quality of multi-modal embeddings. In addition, we introduce a novel audio-lyrics dataset AL-song, which consists of paired audio with corresponding lyrics for the audio-lyrics retrieval task. The empirical findings derived from the AL-song dataset and the benchmark dataset Sounddescs substantiate the efficacy and efficiency of CMRF when juxtaposed with state-of-the-art methodologies.
engineering, electrical & electronic,acoustics
What problem does this paper attempt to address?