ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Ruixiang Zhao,Jian Jia,Yan Li,Xuehan Bai,Quan Chen,Han Li,Peng Jiang,Xirong Li
2024-08-06
Abstract:E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.
Multimedia,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use Automatic Speech Recognition (ASR) text to enhance multimodal product representation learning in Cross - domain Product Retrieval (CdPR). Specifically, product presentations on e - commerce platforms are increasingly multimedia - based, including forms such as pictures, short - videos, and live - streams. To achieve a unified cross - domain product representation, relying solely on visual information is insufficient because there are large intra - product differences and high inter - product similarities. Moreover, although ASR texts obtained from short - videos or live - stream videos are easily accessible, they usually contain a large amount of irrelevant information and noise and are difficult to be directly used for multimodal representation learning. The paper proposes a new method - ASR - enhanced Multimodal Product Representation Learning (AMPere) to solve the problem in the following ways: 1. **ASR Text Denoising**: Use an ASR text summarizer based on a large - language model (LLM) to effectively extract product - specific information from the original ASR text. 2. **Multimodal Fusion**: Input the processed ASR text and visual data into a multi - branch network together to generate a compact multimodal embedding vector. 3. **Experimental Verification**: Conduct extensive experiments on the large - scale three - domain dataset ROPE to verify the effectiveness of the AMPere method, which significantly improves the effect of cross - domain product retrieval. ### Formula Representation - Let \(x\) be a specific product sample, and its instances in the three domains are \(x_p\) (product page), \(x_s\) (short - video), and \(x_l\) (live - stream video), respectively. - For the multimodal scenario, the image \(x_p\) is associated with a title \(t_p\), while the two videos \(x_s\) and \(x_l\) are associated with ASR texts \(t_s\) and \(t_l\), respectively. - The goal is to train a feature extraction network \(F\) to encode a given sample into a \(d\)-dimensional embedding \(e(x)\). The multimodal fusion process can be formalized as: \[ \begin{aligned} &\{w_1, \ldots, w_m\} \leftarrow \text{text - to - tokens}(t, m = 32), \\ &\{y_0, y_1, \ldots, y_m\} \leftarrow \text{RoBERTa}(\{w_1, \ldots, w_m\}), \\ &\{\hat{y}_0, \hat{y}_1, \ldots, \hat{y}_m\} \leftarrow \text{Linear}_{768\times512}(\{y_0, y_1, \ldots, y_m\}), \\ &\{\hat{v}(x), \hat{z}_1, \ldots, \hat{z}_n\} \leftarrow \text{Linear}_{512\times512}(\{v(x), z_1, \ldots, z_n\}), \\ &\{\bar{v}(x), \ldots, \bar{z}_n, \bar{y}_0, \ldots, \bar{y}_m\} \leftarrow \text{Trans}(\{\hat{v}(x), \ldots, \hat{z}_n, \hat{y}_0, \ldots, \hat{y}_m\}) \\ &e(x, t) \leftarrow \text{Linear}_{512\times128}(\bar{v}(x)+\bar{y}_0). \end{aligned} \] In this way, the AMPere method can effectively integrate ASR text and visual information, thereby improving the effect of cross - domain product retrieval.