GRVT: Toward Effective Grocery Recognition Via Vision Transformer

Shu Liu,Xiaoyu Wang,Chengzhang Zhu,Beiji Zou
DOI: https://doi.org/10.1007/978-3-031-23473-6_21
2022-01-01
Abstract:Grocery recognition aims to classify items by visual features of the image. The intention is to improve retailing experience, manage inventory and help visually impaired people. It is an important task in computer vision. Most previous works utilize global image features with a unique decision rule to recognize groceries and products via convolutional neural network (CNN) models. Such methods work on different CNN architectures to explore more accurate and representative features. However, fine-grained characteristics are not considered in feature extraction. Recently, vision transformer (ViT) models achieve success in multiple computer vision tasks. And fine-grained visual categorization is leveraging self-attention mechanism of ViT to learn discriminative regions and features. In this paper, we propose a novel ViT based framework named grocery recognition vision transformer (GRVT). It integrates multiple granularity scales of patches by multi-scale patch embedding to introduce robust image representation without incurring excessive computation cost. The mixed attention selection module guides the network to choose these discriminative patches and crucial regions for fine-grained feature extraction. Our GRVT achieves the state-of-the-art performance on Freiburg Groceries Dataset and Grocery Store Dataset.
What problem does this paper attempt to address?