Indoor scene recognition from images under visual corruptions

Willams de Lima Costa,Raul Ismayilov,Nicola Strisciuglio,Estefania Talavera Martinez
2024-08-23
Abstract:The classification of indoor scenes is a critical component in various applications, such as intelligent robotics for assistive living. While deep learning has significantly advanced this field, models often suffer from reduced performance due to image corruption. This paper presents an innovative approach to indoor scene recognition that leverages multimodal data fusion, integrating caption-based semantic features with visual data to enhance both accuracy and robustness against corruption. We examine two multimodal networks that synergize visual features from CNN models with semantic captions via a Graph Convolutional Network (GCN). Our study shows that this fusion markedly improves model performance, with notable gains in Top-1 accuracy when evaluated against a corrupted subset of the Places365 dataset. Moreover, while standalone visual models displayed high accuracy on uncorrupted images, their performance deteriorated significantly with increased corruption severity. Conversely, the multimodal models demonstrated improved accuracy in clean conditions and substantial robustness to a range of image corruptions. These results highlight the efficacy of incorporating high-level contextual information through captions, suggesting a promising direction for enhancing the resilience of classification systems.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of performance degradation of **Indoor Scene Recognition under Visual Corruptions**. Specifically, although deep learning has made remarkable progress in the field of indoor scene recognition, the performance of existing models often drops significantly when facing image corruptions such as blurring, noise, and compression distortion. #### Core problems of the paper 1. **Limitations of existing methods**: Most existing indoor scene recognition methods rely on high - quality input images and perform poorly when dealing with images with visual corruptions. 2. **Requirements of practical application scenarios**: In the real world, especially in applications such as intelligent robot - assisted living, images may be subject to various types of corruptions. Therefore, a method that can maintain high performance in these situations is required. #### Proposed solutions To solve the above problems, the paper proposes a new method of **Multimodal Data Fusion**, which enhances the accuracy and robustness of the model by combining caption - based semantic features and visual data. Specifically, it includes the following points: - **High - dimensional description extraction**: Use image caption generation technology to generate text descriptions (High - level Description) of the scene and process them through Graph Convolutional Network (GCN). - **Low - dimensional description extraction**: Use Convolutional Neural Networks (CNNs) to extract visual features (Low - level Description) from images. - **Multimodal fusion**: Combine the above two descriptions to improve the robustness of the model against different types of visual corruptions. #### Experimental verification To verify the effectiveness of this method, the author constructs a new benchmark dataset **Places148 - corrupted**, which contains 148 classes of indoor scenes and introduces 15 common types of visual corruptions, each of which is divided into 5 severity levels. The experimental results show that compared with the model using only visual features, the multimodal fusion model has improved performance on both clean and corrupted images, and shows significant robustness especially on severely corrupted images. #### Main contributions 1. **Propose a multimodal method that combines text and visual features** to improve the robustness of indoor scene recognition in the case of visual corruptions. 2. **Introduce and publish the Places148 - corrupted dataset**, providing a new benchmark for studying indoor scene recognition under visual corruptions. 3. **Provide baseline results** for future research reference. Through these contributions, the paper provides a new idea for the field of indoor scene recognition, especially showing great potential in dealing with image corruptions in the real world.