Advances in Theory and Technology of Cross-Media Intelligent Association Analysis

Yu Junqing,Wang Xin,Kuang Kun,Liu Si,Zhang Xinfeng,Song Zikai
DOI: https://doi.org/10.3724/SP.J.1089.2023.19296
2023-01-01
Abstract:This paper provides an analysis of the latest research trends of theories and technologies in cross-media intelligent correlation analysis and semantic understanding. The main content of this report includes a unified representation of cross-media information, knowledge-guided data fusion, cross-media correlation analysis, cross-media knowledge graph, and intelligent applications for multi-modal. Unified representations are preconditions for analyzing and inference about multi-modal information. The semantic consistency between multi-modal information is utilized to eliminate redundant information and achieve unified representation through cross-modal interconversion to learn more comprehensive feature representation. The cross-media association analysis focuses on image-language, video-language, and audio-video-language,aiming to bridge the semantic gap between visual, auditory, language, and fully establish the semantic association between different modalities. By introducing the construction of cross-media knowledge graph,cross-media knowledge graph construction, cross-media knowledge graph embedding, and cross-media knowledge inference, the cross-media representation based on knowledge graph enhances the reliability and improves the efficiency and accuracy of subsequent inference tasks. With the rapid development of cross-modal analysis, intelligent applications for multi-modal are supported by more technologies. According to the required domain knowledge, this paper selects cross-modal applications such as multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining, multi-modal recommendation, cross-modal intelligent inference, and cross-modal medical image prediction, their research progress is compared and reviewed in terms of multi-modal fusion and cross-media inference.
What problem does this paper attempt to address?