Abstract:E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce PumGPT, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. PumGPT focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce PumBench, a benchmark to evaluate product understanding across LVLMs. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of PumGPT, demonstrating the necessity for a specialized model in the e-commerce domain.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on product understanding on e - commerce platforms. Specifically, the paper points out that current methods usually only focus on isolated tasks, such as attribute extraction or classification, which leads to poor adaptability to new tasks and products, as well as usability challenges when dealing with noisy data on the Internet. In addition, existing large - scale visual - language models (LVLMs) lack domain - specific fine - tuning, and thus perform poorly in terms of precision and instruction following. These problems limit the application of these models in actual e - commerce scenarios. To solve the above problems, the paper introduces PUMGPT, an e - commerce - specialized large - scale visual - language model designed specifically for multi - modal product - understanding tasks. The main contributions of PUMGPT include: 1. **Proposing PUMGPT**: This is the first large - scale visual - language model trained specifically for product - understanding tasks in e - commerce, using a high - quality product dataset (about 663,000 pieces of data) filtered for hallucinations for training. 2. **Developing a general hallucination - detection framework**: Using multi - expert collaboration to detect and filter inconsistent attributes in the dataset without human intervention. 3. **Extensive experimental verification**: Through a large number of experiments, it shows the superior performance of PUMGPT among multiple LVLMs, especially in product - understanding tasks, proving the need for a specialized large - scale visual - language model in the e - commerce field. Through these methods and techniques, the paper aims to improve the product - understanding ability of e - commerce platforms, thereby enhancing user experience and operational efficiency.

PUMGPT: A Large Vision-Language Model for Product Understanding

ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction

EcomGPT: Instruction-tuning Large Language Models with Chain-of-Task Tasks for E-commerce

Using LLMs for the Extraction and Normalization of Product Attribute Values

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding

PKGM: A Pre-trained Knowledge Graph Model for E-commerce Application

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction

EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data

MAA-PTG: multimodal aspect-aware product title generation

LiLiuM: eBay's Large Language Models for e-commerce

V$^2$L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval

K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

IPL: Leveraging Multimodal Large Language Models for Intelligent Product Listing

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

PAM: Understanding Product Images in Cross Product Category Attribute Extraction

CULG: Commercial Universal Language Generation.