Multiple Kernel Visual-Auditory Representation Learning for Retrieval

Hong Zhang,Wenping Zhang,Wenhe Liu,Xin Xu,Hehe Fan
DOI: https://doi.org/10.1007/s11042-016-3294-5
IF: 2.577
2016-01-01
Multimedia Tools and Applications
Abstract:Cross-media data representation, which focuses on semantics understanding of multimedia data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives.
What problem does this paper attempt to address?