Abstract:Automatic recognition of grocery products can be used to improve customer flow at checkouts and reduce labor costs and store losses. Product recognition is, however, a challenging task for machine learning-based solutions due to the large number of products and their variations in appearance. In this work, we tackle the challenge of fine-grained product recognition by first extracting a large dataset from a grocery store containing products that are only differentiable by subtle details. Then, we propose a multimodal product recognition approach that uses product images with extracted OCR text from packages to improve fine-grained recognition of grocery products. We evaluate several image and text models separately and then combine them using different multimodal models of varying complexities. The results show that image and textual information complement each other in multimodal models and enable a classifier with greater recognition performance than unimodal models, especially when the number of training samples is limited. Therefore, this approach is suitable for many different scenarios in which product recognition is used to further improve recognition performance. The dataset can be found at https://github.com/Tubbias/finegrainocr.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of automatically identifying products in grocery stores, especially the problem of fine - grained product identification. Specifically, the paper focuses on how to improve the identification accuracy of products with subtle appearance differences by combining images and text information extracted from product packaging (using OCR technology). This problem is very challenging for machine - learning solutions for the following reasons: 1. **Large number of products**: Large supermarkets usually have thousands of different products, and the number of product types in large shopping malls may reach hundreds of thousands. 2. **Unbalanced sales volume**: Some products are sold in large quantities every day, while other products may be sold only a few times a week. 3. **Frequent product updates**: Retailers add or remove products every week. 4. **Subtle visual differences**: There are only subtle visual differences between many products, which makes them difficult to distinguish. To address these challenges, the paper proposes a multi - modal product identification method that combines image and text information. This method not only improves the performance of fine - grained product identification, but also performs particularly well when the number of training samples is limited. In addition, the paper also creates a new dataset (FineGrainOCR), which contains high - resolution images and detailed text information, aiming to solve the problems existing in the existing datasets, such as the lack of high - resolution images, single product orientation, and limited number of categories. In summary, the main contributions of this paper are: - Proposing a multi - modal dataset (FineGrainOCR) for fine - grained grocery product identification, which has the following characteristics: - Subtle differences between products - Multiple product orientations - High - resolution images - Detailed packaging text information - A large number of categories and samples - Proposing a multi - modal product identification method that combines image and text information and is significantly superior to single - modal models. - Based on extensive experimental evaluations, providing suggestions and trade - off analyses on how to implement and deploy multi - modal product identification methods.

Multimodal fine-grained grocery product recognition using image and OCR text

Fine-Grained Grocery Product Recognition by One-Shot Learning.

Product Images Classification with Mul-tiple Features Combination

Batch Normalization Free Rigorous Feature Flow Neural Network for Grocery Product Recognition

Matryoshka Peek: Toward Learning Fine-Grained, Robust, Discriminative Features for Product Search

Products-10K: A Large-scale Product Recognition Dataset

Take Goods from Shelves

End-to-end multi-modal product matching in fashion e-commerce

Knowledge Perceived Multi-modal Pretraining in E-commerce

A Hierarchical Grocery Store Image Dataset with Visual and Semantic Labels

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining

Large Scale Long-tailed Product Recognition System at Alibaba

Unitail: Detecting, Reading, and Matching in Retail Scene

An Improved Deep Learning Approach For Product Recognition on Racks in Retail Stores

A Multimodal Late Fusion Model for E-Commerce Product Classification

Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Training with Product Digital Twins for AutoRetail Checkout

MEP-3M: A large-scale multi-modal E-commerce product dataset

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

Fine-grained Apparel Classification and Retrieval without rich annotations

Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data