Abstract:E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use Automatic Speech Recognition (ASR) text to enhance multimodal product representation learning in Cross - domain Product Retrieval (CdPR). Specifically, product presentations on e - commerce platforms are increasingly multimedia - based, including forms such as pictures, short - videos, and live - streams. To achieve a unified cross - domain product representation, relying solely on visual information is insufficient because there are large intra - product differences and high inter - product similarities. Moreover, although ASR texts obtained from short - videos or live - stream videos are easily accessible, they usually contain a large amount of irrelevant information and noise and are difficult to be directly used for multimodal representation learning. The paper proposes a new method - ASR - enhanced Multimodal Product Representation Learning (AMPere) to solve the problem in the following ways: 1. **ASR Text Denoising**: Use an ASR text summarizer based on a large - language model (LLM) to effectively extract product - specific information from the original ASR text. 2. **Multimodal Fusion**: Input the processed ASR text and visual data into a multi - branch network together to generate a compact multimodal embedding vector. 3. **Experimental Verification**: Conduct extensive experiments on the large - scale three - domain dataset ROPE to verify the effectiveness of the AMPere method, which significantly improves the effect of cross - domain product retrieval. ### Formula Representation - Let \(x\) be a specific product sample, and its instances in the three domains are \(x_p\) (product page), \(x_s\) (short - video), and \(x_l\) (live - stream video), respectively. - For the multimodal scenario, the image \(x_p\) is associated with a title \(t_p\), while the two videos \(x_s\) and \(x_l\) are associated with ASR texts \(t_s\) and \(t_l\), respectively. - The goal is to train a feature extraction network \(F\) to encode a given sample into a \(d\)-dimensional embedding \(e(x)\). The multimodal fusion process can be formalized as: \[ \begin{aligned} &\{w_1, \ldots, w_m\} \leftarrow \text{text - to - tokens}(t, m = 32), \\ &\{y_0, y_1, \ldots, y_m\} \leftarrow \text{RoBERTa}(\{w_1, \ldots, w_m\}), \\ &\{\hat{y}_0, \hat{y}_1, \ldots, \hat{y}_m\} \leftarrow \text{Linear}_{768\times512}(\{y_0, y_1, \ldots, y_m\}), \\ &\{\hat{v}(x), \hat{z}_1, \ldots, \hat{z}_n\} \leftarrow \text{Linear}_{512\times512}(\{v(x), z_1, \ldots, z_n\}), \\ &\{\bar{v}(x), \ldots, \bar{z}_n, \bar{y}_0, \ldots, \bar{y}_m\} \leftarrow \text{Trans}(\{\hat{v}(x), \ldots, \hat{z}_n, \hat{y}_0, \ldots, \hat{y}_m\}) \\ &e(x, t) \leftarrow \text{Linear}_{512\times128}(\bar{v}(x)+\bar{y}_0). \end{aligned} \] In this way, the AMPere method can effectively integrate ASR text and visual information, thereby improving the effect of cross - domain product retrieval.

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Multiple Kernel Visual-Auditory Representation Learning for Retrieval

Cross-Domain Product Representation Learning for Rich-Content E-Commerce

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval

MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

Adversarial Cross-Modal Retrieval

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

Cross-view Semantic Alignment for Livestreaming Product Recognition

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining

ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization

Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Multi-grained Representation Learning for Cross-modal Retrieval

Deep Multigraph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Scalable Deep Multimodal Learning for Cross-Modal Retrieval