Knowledge distillation to effectively attain both region-of-interest and global semantics from an image where multiple objects appear

Seonwhee Jin

2024-07-11

Abstract:Models based on convolutional neural networks (CNN) and transformers have steadily been improved. They also have been applied in various computer vision downstream tasks. However, in object detection tasks, accurately localizing and classifying almost infinite categories of foods in images remains challenging. To address these problems, we first segmented the food as the region-of-interest (ROI) by using the segment-anything model (SAM) and masked the rest of the region except ROI as black pixels. This process simplified the problems into a single classification for which annotation and training were much simpler than object detection. The images in which only the ROI was preserved were fed as inputs to fine-tune various off-the-shelf models that encoded their own inductive biases. Among them, Data-efficient image Transformers (DeiTs) had the best classification performance. Nonetheless, when foods' shapes and textures were similar, the contextual features of the ROI-only images were not enough for accurate classification. Therefore, we introduced a novel type of combined architecture, RveRNet, which consisted of ROI, extra-ROI, and integration modules that allowed it to account for both the ROI's and global contexts. The RveRNet's F1 score was 10% better than other individual models when classifying ambiguous food images. If the RveRNet's modules were DeiT with the knowledge distillation from the CNN, performed the best. We investigated how architectures can be made robust against input noise caused by permutation and translocation. The results indicated that there was a trade-off between how much the CNN teacher's knowledge could be distilled to DeiT and DeiT's innate strength. Code is publicly available at: <a class="link-external link-https" href="https://github.com/Seonwhee-Genome/RveRNet" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper mainly explores how to effectively extract the Region of Interest (ROI) and global semantics in the presence of multiple objects in images, in order to address the challenges of food recognition. The authors first use the Segment-Anything Model (SAM) to segment food and mask non-ROI regions, simplifying the problem to a single-class classification task. Then, they propose a new architecture called RveRNet, which consists of ROI module, additional ROI module, and integration module, that can consider both ROI and global context simultaneously. Experiments show that the RveRNet combining Data-efficient image Transformers (DeiT) and knowledge distillation performs the best in handling ambiguous food categories. Moreover, the paper also studies the robustness of different model architectures to input noise and finds a trade-off in knowledge distillation between CNN and Transformer. The code has been publicly available on GitHub.

Knowledge distillation to effectively attain both region-of-interest and global semantics from an image where multiple objects appear

Research on Knowledge Distillation Algorithm of Object Detection

Real-time and accurate model of instance segmentation of foods

Transferring Knowledge for Food Image Segmentation using Transformers and Convolutions

Fine-Grained Food Image Recognition: A Study on Optimising Convolutional Neural Networks for Improved Performance

Knowledge Distillation for Oriented Object Detection on Aerial Images

Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition

Structured Knowledge Distillation for Accurate and Efficient Object Detection

Regional filtering distillation for object detection

ERKT-Net: Implementing Efficient and Robust Knowledge Distillation for Remote Sensing Image Classification

TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation

Food Image Classification Based on Residual Network.

An Optimized Recurrent Neural Network for re-modernize food dining bowls and estimating food capacity from images

A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation

Refined Image Segmentation for Calorie Estimation of Multiple-dish food items

Multi-level knowledge distillation for low-resolution object detection and facial expression recognition

Distilling Object Detectors With Fine-Grained Feature Imitation

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Distilling Segmenters from CNNs and Transformers for Remote Sensing Images' Semantic Segmentation.

Knowledge Distillation Via Route Constrained Optimization.

Fine-grained food image classification and recipe extraction using a customized deep neural network and NLP