PUMGPT: A Large Vision-Language Model for Product Understanding

Wei Xue,Zongyi Guo,Baoliang Cui,Zheng Xing,Xiaoyi Zeng,Xiufei Wang,Shuhui Wu,Weiming Lu
2024-06-16
Abstract:E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce PumGPT, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. PumGPT focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce PumBench, a benchmark to evaluate product understanding across LVLMs. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of PumGPT, demonstrating the necessity for a specialized model in the e-commerce domain.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on product understanding on e - commerce platforms. Specifically, the paper points out that current methods usually only focus on isolated tasks, such as attribute extraction or classification, which leads to poor adaptability to new tasks and products, as well as usability challenges when dealing with noisy data on the Internet. In addition, existing large - scale visual - language models (LVLMs) lack domain - specific fine - tuning, and thus perform poorly in terms of precision and instruction following. These problems limit the application of these models in actual e - commerce scenarios. To solve the above problems, the paper introduces PUMGPT, an e - commerce - specialized large - scale visual - language model designed specifically for multi - modal product - understanding tasks. The main contributions of PUMGPT include: 1. **Proposing PUMGPT**: This is the first large - scale visual - language model trained specifically for product - understanding tasks in e - commerce, using a high - quality product dataset (about 663,000 pieces of data) filtered for hallucinations for training. 2. **Developing a general hallucination - detection framework**: Using multi - expert collaboration to detect and filter inconsistent attributes in the dataset without human intervention. 3. **Extensive experimental verification**: Through a large number of experiments, it shows the superior performance of PUMGPT among multiple LVLMs, especially in product - understanding tasks, proving the need for a specialized large - scale visual - language model in the e - commerce field. Through these methods and techniques, the paper aims to improve the product - understanding ability of e - commerce platforms, thereby enhancing user experience and operational efficiency.