Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning

Chi-Sheng Chen,Guan-Ying Chen,Dong Zhou,Di Jiang,Dai-Shi Chen
2024-09-07
Abstract:Food classification is the foundation for developing food vision tasks and plays a key role in the burgeoning field of computational nutrition. Due to the complexity of food requiring fine-grained classification, recent academic research mainly modifies Convolutional Neural Networks (CNNs) and/or Vision Transformers (ViTs) to perform food category classification. However, to learn fine-grained features, the CNN backbone needs additional structural design, whereas ViT, containing the self-attention module, has increased computational complexity. In recent months, a new Sequence State Space (S4) model, through a Selection mechanism and computation with a Scan (S6), colloquially termed Mamba, has demonstrated superior performance and computation efficiency compared to the Transformer architecture. The VMamba model, which incorporates the Mamba mechanism into image tasks (such as classification), currently establishes the state-of-the-art (SOTA) on the ImageNet dataset. In this research, we introduce an academically underestimated food dataset CNFOOD-241, and pioneer the integration of a residual learning framework within the VMamba model to concurrently harness both global and local state features inherent in the original VMamba architectural design. The research results show that VMamba surpasses current SOTA models in fine-grained and food classification. The proposed Res-VMamba further improves the classification accuracy to 79.54\% without pretrained weight. Our findings elucidate that our proposed methodology establishes a new benchmark for SOTA performance in food recognition on the CNFOOD-241 dataset. The code can be obtained on GitHub: <a class="link-external link-https" href="https://github.com/ChiShengChen/ResVMamba" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the problem of fine-grained visual classification in food categorization. Specifically: 1. **Existing Challenges**: The biggest challenge in current food classification is the large intra-class variance and small inter-class variance. Even slight differences in ingredients can lead to visually similar but semantically different food types (e.g., minced pork fried rice vs. shrimp fried rice). 2. **Proposed Method**: To tackle these challenges, the paper introduces the Res-VMamba model, an improved version based on the VMamba model, which combines the residual learning framework to simultaneously utilize both global and local features in images. This approach helps to improve the accuracy of fine-grained food classification. 3. **Dataset Contribution**: The paper also provides a dataset named CNFOOD-241, a large-scale dataset containing 241 types of Chinese food. This dataset features uniform image sizes (600×600 pixels) and an imbalanced distribution among categories, making it a challenging benchmark dataset. 4. **Experimental Results**: Experimental results show that the Res-VMamba model achieved a classification accuracy of 79.54% on the CNFOOD-241 dataset, surpassing existing state-of-the-art methods and establishing new benchmark performance in fine-grained and food classification tasks. In summary, the main goal of the paper is to enhance the performance of fine-grained food classification by introducing a new model architecture and validating its effectiveness in practical applications.