Multimodal fine-grained grocery product recognition using image and OCR text

Pettersson, Tobias,Riveiro, Maria,Löfström, Tuwe
DOI: https://doi.org/10.1007/s00138-024-01549-9
IF: 2.983
2024-06-08
Machine Vision and Applications
Abstract:Automatic recognition of grocery products can be used to improve customer flow at checkouts and reduce labor costs and store losses. Product recognition is, however, a challenging task for machine learning-based solutions due to the large number of products and their variations in appearance. In this work, we tackle the challenge of fine-grained product recognition by first extracting a large dataset from a grocery store containing products that are only differentiable by subtle details. Then, we propose a multimodal product recognition approach that uses product images with extracted OCR text from packages to improve fine-grained recognition of grocery products. We evaluate several image and text models separately and then combine them using different multimodal models of varying complexities. The results show that image and textual information complement each other in multimodal models and enable a classifier with greater recognition performance than unimodal models, especially when the number of training samples is limited. Therefore, this approach is suitable for many different scenarios in which product recognition is used to further improve recognition performance. The dataset can be found at https://github.com/Tubbias/finegrainocr.
computer science, cybernetics, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of automatically identifying products in grocery stores, especially the problem of fine - grained product identification. Specifically, the paper focuses on how to improve the identification accuracy of products with subtle appearance differences by combining images and text information extracted from product packaging (using OCR technology). This problem is very challenging for machine - learning solutions for the following reasons: 1. **Large number of products**: Large supermarkets usually have thousands of different products, and the number of product types in large shopping malls may reach hundreds of thousands. 2. **Unbalanced sales volume**: Some products are sold in large quantities every day, while other products may be sold only a few times a week. 3. **Frequent product updates**: Retailers add or remove products every week. 4. **Subtle visual differences**: There are only subtle visual differences between many products, which makes them difficult to distinguish. To address these challenges, the paper proposes a multi - modal product identification method that combines image and text information. This method not only improves the performance of fine - grained product identification, but also performs particularly well when the number of training samples is limited. In addition, the paper also creates a new dataset (FineGrainOCR), which contains high - resolution images and detailed text information, aiming to solve the problems existing in the existing datasets, such as the lack of high - resolution images, single product orientation, and limited number of categories. In summary, the main contributions of this paper are: - Proposing a multi - modal dataset (FineGrainOCR) for fine - grained grocery product identification, which has the following characteristics: - Subtle differences between products - Multiple product orientations - High - resolution images - Detailed packaging text information - A large number of categories and samples - Proposing a multi - modal product identification method that combines image and text information and is significantly superior to single - modal models. - Based on extensive experimental evaluations, providing suggestions and trade - off analyses on how to implement and deploy multi - modal product identification methods.