Abstract:With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at <a class="link-external link-https" href="https://github.com/Huxiaowan/SGMN" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the product retrieval problem in the live - streaming with goods - selling scenario, namely **Livestreaming Product Retrieval (LPR)**. Specifically, the LPR task faces three main challenges in practical applications: 1. **Identifying the target product from background - interfering products**: During the live - streaming process, the salesperson usually displays multiple products, but mainly promotes a specific product. How to accurately identify the target product from these background products is a difficult problem. 2. **Heterogeneity between video and image**: The appearance of the product shown in the live - streaming video is often quite different from the standardized product image in the store, which leads to difficulties in matching between the video and the image. 3. **A large number of visually similar products**: There are a large number of visually very similar products in the online store, which poses high requirements for the model's fine - grained feature learning ability. To address these challenges, the author proposes a new method - **Spatiotemporal Graph Guided Multi - modal Network (SGMN)**. The main contributions of this method are as follows: 1. **Text - guided attention mechanism**: Utilize the verbal explanations provided by the salesperson during the live - streaming, through Automatic Speech Recognition (ASR) transcription and image captions to guide the model to focus on products highly relevant to the verbal context, reducing background interference. 2. **Graph - based cross - domain interaction module**: Design a graph structure to capture the spatiotemporal correlations between video and image, and for the first time explore the use of a sequence - to - sequence graph learning method to model and enhance the temporal consistency and spatial correlation across domains. 3. **Selective multi - modal fusion module**: By selecting the top K hard cases in the global ranking and fusing their visual and text representations, implicitly recalibrate the ranking and distinguish semantic heterogeneity, improving the ability to recognize products with subtle visual differences. Through extensive quantitative and qualitative experiments, the author demonstrates the superior performance of the proposed SGMN model on a large - scale benchmark dataset, significantly outperforming the existing state - of - the - art methods.

Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval

Leveraging Tripartite Interaction Information from Live Stream E-Commerce for Improving Product Recommendation

Cross-view Semantic Alignment for Livestreaming Product Recognition

Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

Neural Graph Matching for Video Retrieval in Large-Scale Video-driven E-commerce

How do you say it matters? A multimodal analytics framework for product return prediction in live streaming e-commerce

A multimodal analytics framework for product sales prediction with the reputation of anchors in live streaming e-commerce

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

MSGA-Net: Progressive Feature Matching via Multi-layer Sparse Graph Attention

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Deepstore: An Interaction-Aware Wide&Deep Model For Store Site Recommendation With Attentional Spatial Embeddings

LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

A two-stage prediction model based on behavior mining in livestream e-commerce

An Interpretable Ensemble of Graph and Language Models for Improving Search Relevance in E-Commerce

Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network Approach

Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images

Video Saliency Prediction using Spatiotemporal Residual Attentive Networks.

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerce

Cross-Domain Product Representation Learning for Rich-Content E-Commerce