Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Donghuo Zeng,Yi Yu,Keizo Oyama
DOI: https://doi.org/10.48550/arXiv.1908.03744
2019-08-10
Abstract:Deep learning has successfully shown excellent performance in learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and video, should be taken into account. Music video retrieval by given musical audio is a natural way to search and interact with music contents. In this work, we study cross-modal music video retrieval in terms of emotion similarity. Particularly, audio of an arbitrary length is used to retrieve a longer or full-length music video. To this end, we propose a novel audio-visual embedding algorithm by Supervised Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into a shared space to bridge the semantic gap between audio and video. This also preserves the similarity between audio and visual contents from different videos with the same class label and the temporal structure. The contribution of our approach is mainly manifested in the two aspects: i) We propose to select top k audio chunks by attention-based Long Short-Term Memory (LSTM)model, which can represent good audio summarization with local properties. ii) We propose an end-to-end deep model for cross-modal audio-visual learning where S-DCCA is trained to learn the semantic correlation between audio and visual modalities. Due to the lack of music video dataset, we construct 10K music video dataset from YouTube 8M dataset. Some promising results such as MAP and precision-recall show that our proposed model can be applied to music video retrieval.
Multimedia,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the emotional similarity problem in cross - modal music - video retrieval. Specifically, the paper studies how to use a music audio of any length to retrieve a complete or longer music video with similar emotions. This problem is of great significance in the fields of multimedia and computer vision, especially in the search and interaction of music content. ### Background and Problem Description of the Paper With the explosion in the number of music videos on the Internet, it becomes possible to learn the correlation between audio and video. Music videos usually contain two modalities, visual and audio, which are embedded in the time series of music to express the theme and story of the music. In addition, music videos also convey strong emotions, which are reflected in both the audio and visual modalities. Therefore, the goal of the paper is to learn a joint embedding space so that the music audio and visual content are semantically consistent. ### Main Challenges 1. **Differences in Low - level Features of Different Modalities**: Audio and video are different modalities with different low - level features and time structures. 2. **Variable - length Audio Queries**: Users can use audio segments of any length as queries, and the system needs to be able to find complete or longer music videos with similar emotions. 3. **Learning of Cross - modal Correlation**: It is necessary to consider the time structures of both audio and video simultaneously to learn the semantic correlation between them. ### Solutions To solve the above problems, the paper proposes the following methods: 1. **Supervised - based Deep Canonical Correlation Analysis (S - DCCA)**: - **Joint Feature Space**: Project audio and video into a shared space through S - DCCA to bridge the semantic gap between different modalities. - **Time Structure Preservation**: Preserve the similarity and time structure between the audio and visual content of different videos from the same class label during the projection process. 2. **Attention Mechanism**: - **Selection of Representative Audio Segments**: Use an attention - based Long Short - Term Memory (LSTM) model to select the most representative audio segments (chunks), which can well summarize the emotional features of the audio while preserving the time structure. 3. **End - to - End Deep Model**: - **Cross - modal Audio - Visual Learning**: Propose an end - to - end deep architecture in which S - DCCA is used to learn the semantic correlation between audio and visual modalities. ### Experiments and Results - **Dataset**: A dataset containing 10,000 music videos was constructed from the YouTube - 8M dataset. - **Evaluation Metrics**: Recall, Precision, and Mean Average Precision (MAP) were used for evaluation. - **Experimental Setup**: 5 - fold cross - validation was used, with a training batch size of 512, a test batch size of 64, and 50 training epochs. ### Experimental Results - **Precision - Recall Curve**: Shows the performance of different models under different configurations, and the S - DCCA - extend2 model performs best in most cases. - **MAP Results**: Tables II and Figure 9 show that the S - DCCA model performs well under different configurations, especially when using 3 out of 9 audio segments as queries, achieving the best performance. ### Conclusion The S - DCCA model proposed in the paper performs well in the cross - modal music - video retrieval task, especially in handling variable - length audio queries and preserving the time structure. Future work can further explore more diverse datasets and more complex model structures to improve the retrieval accuracy and robustness.