Knowledge distillation to effectively attain both region-of-interest and global semantics from an image where multiple objects appear

Seonwhee Jin
2024-07-11
Abstract:Models based on convolutional neural networks (CNN) and transformers have steadily been improved. They also have been applied in various computer vision downstream tasks. However, in object detection tasks, accurately localizing and classifying almost infinite categories of foods in images remains challenging. To address these problems, we first segmented the food as the region-of-interest (ROI) by using the segment-anything model (SAM) and masked the rest of the region except ROI as black pixels. This process simplified the problems into a single classification for which annotation and training were much simpler than object detection. The images in which only the ROI was preserved were fed as inputs to fine-tune various off-the-shelf models that encoded their own inductive biases. Among them, Data-efficient image Transformers (DeiTs) had the best classification performance. Nonetheless, when foods' shapes and textures were similar, the contextual features of the ROI-only images were not enough for accurate classification. Therefore, we introduced a novel type of combined architecture, RveRNet, which consisted of ROI, extra-ROI, and integration modules that allowed it to account for both the ROI's and global contexts. The RveRNet's F1 score was 10% better than other individual models when classifying ambiguous food images. If the RveRNet's modules were DeiT with the knowledge distillation from the CNN, performed the best. We investigated how architectures can be made robust against input noise caused by permutation and translocation. The results indicated that there was a trade-off between how much the CNN teacher's knowledge could be distilled to DeiT and DeiT's innate strength. Code is publicly available at: <a class="link-external link-https" href="https://github.com/Seonwhee-Genome/RveRNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper mainly explores how to effectively extract the Region of Interest (ROI) and global semantics in the presence of multiple objects in images, in order to address the challenges of food recognition. The authors first use the Segment-Anything Model (SAM) to segment food and mask non-ROI regions, simplifying the problem to a single-class classification task. Then, they propose a new architecture called RveRNet, which consists of ROI module, additional ROI module, and integration module, that can consider both ROI and global context simultaneously. Experiments show that the RveRNet combining Data-efficient image Transformers (DeiT) and knowledge distillation performs the best in handling ambiguous food categories. Moreover, the paper also studies the robustness of different model architectures to input noise and finds a trade-off in knowledge distillation between CNN and Transformer. The code has been publicly available on GitHub.