Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning

Chi-Sheng Chen,Guan-Ying Chen,Dong Zhou,Di Jiang,Dai-Shi Chen

2024-09-07

Abstract:Food classification is the foundation for developing food vision tasks and plays a key role in the burgeoning field of computational nutrition. Due to the complexity of food requiring fine-grained classification, recent academic research mainly modifies Convolutional Neural Networks (CNNs) and/or Vision Transformers (ViTs) to perform food category classification. However, to learn fine-grained features, the CNN backbone needs additional structural design, whereas ViT, containing the self-attention module, has increased computational complexity. In recent months, a new Sequence State Space (S4) model, through a Selection mechanism and computation with a Scan (S6), colloquially termed Mamba, has demonstrated superior performance and computation efficiency compared to the Transformer architecture. The VMamba model, which incorporates the Mamba mechanism into image tasks (such as classification), currently establishes the state-of-the-art (SOTA) on the ImageNet dataset. In this research, we introduce an academically underestimated food dataset CNFOOD-241, and pioneer the integration of a residual learning framework within the VMamba model to concurrently harness both global and local state features inherent in the original VMamba architectural design. The research results show that VMamba surpasses current SOTA models in fine-grained and food classification. The proposed Res-VMamba further improves the classification accuracy to 79.54\% without pretrained weight. Our findings elucidate that our proposed methodology establishes a new benchmark for SOTA performance in food recognition on the CNFOOD-241 dataset. The code can be obtained on GitHub: <a class="link-external link-https" href="https://github.com/ChiShengChen/ResVMamba" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the problem of fine-grained visual classification in food categorization. Specifically: 1. **Existing Challenges**: The biggest challenge in current food classification is the large intra-class variance and small inter-class variance. Even slight differences in ingredients can lead to visually similar but semantically different food types (e.g., minced pork fried rice vs. shrimp fried rice). 2. **Proposed Method**: To tackle these challenges, the paper introduces the Res-VMamba model, an improved version based on the VMamba model, which combines the residual learning framework to simultaneously utilize both global and local features in images. This approach helps to improve the accuracy of fine-grained food classification. 3. **Dataset Contribution**: The paper also provides a dataset named CNFOOD-241, a large-scale dataset containing 241 types of Chinese food. This dataset features uniform image sizes (600×600 pixels) and an imbalanced distribution among categories, making it a challenging benchmark dataset. 4. **Experimental Results**: Experimental results show that the Res-VMamba model achieved a classification accuracy of 79.54% on the CNFOOD-241 dataset, surpassing existing state-of-the-art methods and establishing new benchmark performance in fine-grained and food classification tasks. In summary, the main goal of the paper is to enhance the performance of fine-grained food classification by introducing a new model architecture and validating its effectiveness in practical applications.

Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning

RSMamba: Remote Sensing Image Classification With State Space Model

Real-time and accurate model of instance segmentation of foods

Fine grained food image recognition based on swin transformer

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Convolution-Enhanced Bi-Branch Adaptive Transformer With Cross-Task Interaction for Food Category and Ingredient Recognition

Vision Mamba Distillation for Low-resolution Fine-grained Image Classification

VMamba: Visual State Space Model

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Fine-grained recognition of Chinese food image based on DenseNet with attention mechanism

A Lightweight Hybrid Model with Location-Preserving ViT for Efficient Food Recognition

Fine-grained food image classification and recipe extraction using a customized deep neural network and NLP

Fine-Grained Food Image Recognition: A Study on Optimising Convolutional Neural Networks for Improved Performance

Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

Gm and Km alleles in two Spanish Pyrenean populations (Andorra and Pallars Sobirà): a review of Gm variation in the Western Mediterranean basin

A Study of Multi-Task and Region-Wise Deep Learning for Food Ingredient Recognition.

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation