Abstract:Graph neural networks (GNNs) have shown great potential for personalized recommendation. At the core is to reorganize interaction data as a user-item bipartite graph and exploit high-order connectivity among user and item nodes to enrich their representations. While achieving great success, most existing works consider interaction graph based only on ID information, foregoing item contents from multiple modalities (e.g., visual, acoustic, and textual features of micro-video items). Distinguishing personal interests on different modalities at a granular level was not explored until recently proposed MMGCN (Wei et al., 2019). However, it simply employs GNNs on parallel interaction graphs and treats information propagated from all neighbors equally, failing to capture user preference adaptively. Hence, the obtained representations might preserve redundant, even noisy information, leading to non-robustness and suboptimal performance. In this work, we aim to investigate how to adopt GNNs on multimodal interaction graphs, to adaptively capture user preference on different modalities and offer in-depth analysis on why an item is suitable to a user. Towards this end, we propose a new Multimodal Graph Attention Network, short for MGAT, which disentangles personal interests at the granularity of modality. In particular, built upon multimodal interaction graphs, MGAT conducts information propagation within individual graphs, while leveraging the gated attention mechanism to identify varying importance scores of different modalities to user preference. As such, it is able to capture more complex interaction patterns hidden in user behaviors and provide a more accurate recommendation. Empirical results on two micro-video recommendation datasets, Tiktok and MovieLens, show that MGAT exhibits substantial improvements over the state-of-the-art baselines like NGCF (Wang, He, et al., 2019) and MMGCN (Wei et al., 2019). Further analysis on a case study illustrates how MGAT generates attentive information flow over multimodal interaction graphs.

Hierarchical Multi-Modal Attention Network for Time-Sync Comment Video Recommendation

Multimodal-enhanced hierarchical attention network for video captioning

Deep Hierarchical Attention Network for Video Description

Attention-guided Multi-step Fusion: A Hierarchical Fusion Network for Multimodal Recommendation

HTDA: Hierarchical Time-Based Directional Attention Network for Sequential User Behavior Modeling.

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

HMNet: a Hierarchical Multi-Modal Network for Educational Video Concept Prediction

Dynamic Memory based Attention Network for Sequential Recommendation

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video

Temporal Hierarchical Attention at Category- and Item-Level for Micro-Video Click-Through Prediction.

User-Video Co-Attention Network for Personalized Micro-video Recommendation

Attention-Based Multimodal Fusion for Video Description

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

MGAT: Multimodal Graph Attention Network for Recommendation.

Multimodal Semantic Attention Network for Video Captioning

Context-Dependent Propagating-Based Video Recommendation in Multimodal Heterogeneous Information Networks

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Mixed Attention Network for Cross-domain Sequential Recommendation