Abstract:Many real-world applications involve data from multiple modalities and thus exhibit the viewheterogeneity. For example, user modeling on social media might leverage both the topology of the underlying social network and the content of the users' posts; in the medical domain, multiple views could be X-ray images taken at different poses. To date, various techniques have been proposed to achieve promising results, such as canonical correlation analysis based methods, etc. In the meanwhile, it is critical for decision-makers to be able to understand the prediction results from these methods. For example, given the diagnostic result that a model provided based on the X-ray images of a patient at different poses, the doctor needs to know why the model made such a prediction. However, state-of-the-art techniques usually suffer from the inability to utilize the complementary information of each view and to explain the predictions in an interpretable manner. To address these issues, in this paper, we propose a deep coattention network for multi-view subspace learning, which aims to extract both the common information and the complementary information in an adversarial setting and provide robust interpretations behind the prediction to the end-users via the co-attention mechanism. In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation by incorporating the classifier into our model. This improves the quality of latent representation and accelerates the convergence speed. Finally, we develop an efficient iterative algorithm to find the optimal encoders and discriminator, which are evaluated extensively on synthetic and real-world data sets. We also conduct a case study to demonstrate how the proposed method robustly interprets the predictions on an image data set.

Learning Effective Representations from Sparse Mutlimodal Data on Content Curation Social Networks.

Multimodal Joint Representation for User Interest Analysis on Content Curation Social Networks

Recommendations for Different Tasks Based on the Uniform Multimodal Joint Representation

Multimodal Learning of Social Image Representation by Exploiting Social Relations

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

MLLDA: Multi-level LDA for Modelling Users on Content Curation Social Networks

Deep Co-Attention Network for Multi-View Subspace Learning

Multimodal sparse representation learning and applications

From Content to Links: Social Image Embedding with Deep Multimodal Model.

Multimodal learning based approaches for link prediction in social networks

Multimodal visual dictionary learning via heterogeneous latent semantic sparse coding

Multimodal graph convolutional networks for high quality content recognition

Learning Discriminative Representations for Semantic Cross Media Retrieval

Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks

A Reliable Cross-Site User Generated Content Modeling Method Based on Topic Model

Multi-modal Learning for Social Image Classification

Learning Socially Embedded Visual Representation from Scratch

A Deep Approach For Multi-Modal User Attribute Modeling

Multimodal Sentiment Analysis Based on Disentangled Representation Learning and Cross-Modal-context Association Mining

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

Weighted Graph-structured Semantics Constraint Network for Cross-Modal Retrieval