Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational Recommendation

Wenzhe Du,Su Haoyang,Nguyen Cam-Tu,Jian Sun
DOI: https://doi.org/10.1145/3581783.3613755
2023-01-01
Abstract:Multimodal Conversational Recommendation aims to find appropriate products based on a multi-turn dialogue, where user requests and products can be presented in both visual and textual modalities. While previous studies have focused on understanding user preferences from conversational contexts, the task of product modeling has been relatively unexplored. This study targets to fill this gap and demonstrates that information from multiple product views and cross-view interactions are essential for recommendation, along with dialog information. To this end, a product image is first encoded using a gated multi-view image encoder, and representations for the global and local views are obtained. On the textual side, two views are considered: the structure view (product attributes) and the sequence view (product description/reviews). Two forms of inter-modal interactions for product representation are then modeled: interactions between the global image view and the textual structure view, and interactions between the local image view and the textual sequence view. Furthermore, the representation is enhanced to attend to the latest user request in the dialog context, resulting in query-aware product representation. The experimental results indicate that our method, named Enteract, achieves state-of-the-art performance on two well-known datasets (MMD and SIMMC).
What problem does this paper attempt to address?