From Abstract to Details

Fangxiong Xiao,Lixi Deng,Jingjing Chen,Houye Ji,Xiaorui Yang,Zhuoye Ding,Bo Long
DOI: https://doi.org/10.1145/3503161.3548366
2022-01-01
Abstract:In E-commerce recommendation, Click-Through Rate (CTR) prediction has been extensively studied in both academia and industry to enhance user experience and platform benefits. At present, most popular CTR prediction methods are concatenation-based models that represent items by simply merging multiple heterogeneous features including ID, visual, and text features into a large vector. As these heterogeneous modalities have moderately different properties, directly concatenating them without mining the correlation and reducing the redundancy are unlikely to achieve the optimal fusion results. Besides, these concatenation-based models treat all modalities equally for each user and overlook the fact that users tend to pay unequal attention to information of various modalities when browsing items in the real scenario. To address the above issues, this paper proposes a generative multimodal fusion framework (GMMF) for CTR prediction task. To eliminate the redundancy and strength the complementary of multimodal features, GMMF generates the new visual and text representations by a Difference-Set network (DSN). These representations are non-overlapping with the information conveyed by ID embedding. Specifically, DSN maps ID embedding into visual and text modalities and depicts the difference between multiple modalities based on their properties. Besides, GMMF learns unequal weights to multiple modalities with a Modal-Interest network (MIN) modeling users' preference on heterogeneous modalities. These weights reflect the usual habits and hobbies of users. Finally, We conduct extensive experiments on both public and collected industrial datasets, and the results show that GMMF greatly improves performance and achieves state-of-the-art performance.
What problem does this paper attempt to address?