Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Yanzhe Chen,Huasong Zhong,Xiangteng He,Yuxin Peng,Lele Cheng
DOI: https://doi.org/10.1145/3581783.3612408
2023-01-01
Abstract:In e-commerce, products and micro-videos serve as two primary carriers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of specific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, existing datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multi-modal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclusion of user retrieval intentions. To address these limitations, we propose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven approach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Additionally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experiments on two downstream tasks validate the effectiveness of our proposed approaches. The dataset and source code are available at https://github.com/PKU-ICST-MIPL/Real20M_ACMMM2023.
What problem does this paper attempt to address?