Abstract:With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on data-driven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the multimodal analysis of multimedia data. Specifically, it focuses on two core scientific problems: 1. **Data - driven Correlational Representation**: How to establish an effective correlational representation between data of different modalities, so that multimodal methods can surpass methods that only utilize single - modality information. The solution to this problem is crucial for improving the efficiency and accuracy of multimodal data analysis. The paper mentions that traditional multimodal analysis methods can be divided into two major categories: feature fusion and semantic fusion, but these methods are either inefficient or unable to fully utilize the rich information in multimodal data. With the successful application of deep neural networks, a new method - Intermediate Fusion - can fuse information of different modalities at the intermediate level of the hidden space, thus making better use of multimodal data. 2. **Knowledge - guided Data Fusion**: How to increase the interpretability of the model through the guidance of domain knowledge while maintaining the scalability of data - driven methods. Although data - driven methods perform well in handling large - scale multimodal data, their results sometimes lack interpretability, especially when facing uncertain big data. Humans can use domain knowledge to assist in decision - making, thereby improving the interpretability and accuracy of decision - making. Therefore, combining data - driven and knowledge - guided methods and finding the balance between the two has become an important research direction. The paper discusses three families of methods that may be suitable for knowledge - guided cross - modal fusion, namely Bayesian Inference, Teacher - student Network, and Reinforcement Learning. In summary, this paper aims to explore key problems in multimodal data analysis, especially data - driven correlational representation and knowledge - guided data fusion, and proposes directions for future research, such as cross - modal reasoning, cross - modal cognition, and cross - modal collective intelligence, etc., in order to promote the further development of multimodal analysis in the multimedia field.

Multi-modal Deep Analysis for Multimedia

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multiple Kernel Visual-Auditory Representation Learning for Retrieval

Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry and Fusion

Optimal Multimodal Fusion for Multimedia Data Analysis

Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks.

Deep Multimodal Data Fusion

A Survey on Deep Learning for Multimodal Data Fusion

Introduction to Big Multimodal Multimedia Data with Deep Analytics

Advances in Theory and Technology of Cross-Media Intelligent Association Analysis

Special Issue on Cross-Modal Retrieval and Analysis

Extracting Multimedia Semantics Based On Independent Modality Discovering And Fusion

Multimodal Deep Learning Based on Multiple Correspondence Analysis for Disaster Management

Current Research Status and Prospects on Multimedia Content Understanding

Interpretation on Multi-modal Visual Fusion

Multimedia Analysis with Deep Learning

Multimodal fusion for multimedia analysis: a survey

Content-oriented Multimedia Document Understanding Through Cross-Media Correlation

Multimodal Data Mining in a Multimedia Database Based on Structured Max Margin Learning

Multimedia Intelligence: when Multimedia Meets Artificial Intelligence