Abstract:While recommender systems with multi-modal item representations (image, audio, and text), have been widely explored, learning recommendations from multi-modal user interactions (e.g., clicks and speech) remains an open problem. We study the case of multi-modal user interactions in a setting where users engage with a service provider through multiple channels (website and call center). In such cases, incomplete modalities naturally occur, since not all users interact through all the available channels. To address these challenges, we publish a real-world dataset that allows progress in this under-researched area. We further present and benchmark various methods for leveraging multi-modal user interactions for item recommendations, and propose a novel approach that specifically deals with missing modalities by mapping user interactions to a common feature space. Our analysis reveals important interactions between the different modalities and that a frequently occurring modality can enhance learning from a less frequent one.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the utilization of multimodal user interactions (e.g., clicks and voice) in recommendation systems. Specifically, the authors focus on scenarios where users interact with service providers through multiple channels (such as websites and call centers). In such cases, incomplete modal information naturally arises because not all users interact through all available channels. Existing multimodal recommendation systems primarily focus on multimodal item representations (such as images, audio, and text) but have not effectively addressed the issue of missing modalities in multimodal user interactions. ### Main Contributions 1. **Dataset Release**: Created and released a real-world dataset containing multimodal user interactions to advance research in this field. The dataset comes from a company dealing with personal insurance products and includes user website session records, user call records with insurance agents, and purchase behaviors. 2. **Method Comparison**: Proposed various methods to utilize multimodal user interactions for recommendations and conducted experimental comparisons. These methods include existing imputation methods, knowledge distillation methods, and a new approach that handles missing modalities by mapping different modalities into a common feature space. 3. **In-depth Analysis**: Conducted an in-depth analysis of the experimental results, revealing the interactions between different modalities and exploring how to extract information from frequently occurring modalities to enhance learning for infrequent modalities. ### Research Background - **Multimodal Recommendation Systems**: Existing research mainly focuses on multimodal item representations, neglecting multimodal user interactions. - **Insurance Domain**: User feedback is sparse because there are few types of insurance products and users rarely interact with them. Most previous work has increased user feedback by supplementing demographic information. - **Conversation-based Recommendations**: Existing research mainly focuses on generating recommendations using past conversations but does not integrate other modalities. ### Dataset Description - **Time Range**: May 1, 2022, to April 30, 2023. - **Data Source**: Commercial insurance company. - **Data Types**: User website session records, user call records with insurance agents, purchase behaviors. - **Data Preprocessing**: Removed low-frequency items, deduplicated repeated actions, removed short conversations and sessions, truncated long conversations and sessions. ### Methods - **Problem Formalization**: The goal is to predict which items a user will purchase next, given their past conversations and website sessions. - **Existing Methods**: - **Popular Recommendation**: Recommends the most frequently purchased items. - **Conversation Model**: Model trained solely on conversations. - **Website Session Model**: Model trained solely on website sessions. - **Late Fusion**: Combines the outputs of conversation and website session models. - **Knowledge Distillation**: Trains a joint model on users who have both conversation and website session data. - **Generative Imputation**: Generates data for missing modalities. - **Neutral Imputation**: Fills missing modalities with average or most frequent sessions. - **Proposed Methods**: - **Keyword Model**: Represents conversations as keywords and matches them with actions in website sessions. - **Latent Feature Model**: Maps conversations and website sessions into a common latent feature space. - **Relative Representation Model**: Uses relative representation methods to compare and integrate different modalities. ### Experimental Results - **Keyword Model**: Improved recommendation accuracy by unifying the representation of conversations and website sessions through keyword matching. - **Latent Feature Model**: Effectively handled the missing modality problem by mapping different modalities into a common feature space. - **Relative Representation Model**: Further enhanced the comparison and integration between different modalities through relative representation methods. ### Conclusion The paper provides significant advancements in the application of multimodal user interactions in recommendation systems by creating a real-world dataset and proposing new methods. Experimental results show that mapping different modalities into a common feature space can effectively handle the missing modality problem and improve the performance of recommendation systems.

Dataset and Models for Item Recommendation Using Multi-Modal User Interactions

Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation

Multi-modal Recommendation Based on Knowledge Graph

Do We Really Need to Drop Items with Missing Modalities in Multimodal Recommendation?

Dealing with Missing Modalities in Multimodal Recommendation: a Feature Propagation-based Approach

Multi-modal Generative Models in Recommendation System

Beyond Co-occurrence: Multi-modal Session-based Recommendation

On Popularity Bias of Multimodal-aware Recommender Systems: a Modalities-driven Analysis

Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation

Multimodal Difference Learning for Sequential Recommendation

Multimodal Recommender Systems: A Survey

Interest-Related Item Similarity Model Based on Multimodal Data for Top-N Recommendation

Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation

Dual-view multi-modal contrastive learning for graph-based recommender systems

DiffMM: Multi-Modal Diffusion Model for Recommendation

Self-Supervised Multi-Modal Sequential Recommendation

Multi-Modal Self-Supervised Learning for Recommendation

Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational Recommendation

Multimodal Sparse Linear Integration for Content-Based Item Recommendation

SPACE: Self-supervised Dual Preference Enhancing Network for Multimodal Recommendation