Abstract:The classification of indoor scenes is a critical component in various applications, such as intelligent robotics for assistive living. While deep learning has significantly advanced this field, models often suffer from reduced performance due to image corruption. This paper presents an innovative approach to indoor scene recognition that leverages multimodal data fusion, integrating caption-based semantic features with visual data to enhance both accuracy and robustness against corruption. We examine two multimodal networks that synergize visual features from CNN models with semantic captions via a Graph Convolutional Network (GCN). Our study shows that this fusion markedly improves model performance, with notable gains in Top-1 accuracy when evaluated against a corrupted subset of the Places365 dataset. Moreover, while standalone visual models displayed high accuracy on uncorrupted images, their performance deteriorated significantly with increased corruption severity. Conversely, the multimodal models demonstrated improved accuracy in clean conditions and substantial robustness to a range of image corruptions. These results highlight the efficacy of incorporating high-level contextual information through captions, suggesting a promising direction for enhancing the resilience of classification systems.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of performance degradation of **Indoor Scene Recognition under Visual Corruptions**. Specifically, although deep learning has made remarkable progress in the field of indoor scene recognition, the performance of existing models often drops significantly when facing image corruptions such as blurring, noise, and compression distortion. #### Core problems of the paper 1. **Limitations of existing methods**: Most existing indoor scene recognition methods rely on high - quality input images and perform poorly when dealing with images with visual corruptions. 2. **Requirements of practical application scenarios**: In the real world, especially in applications such as intelligent robot - assisted living, images may be subject to various types of corruptions. Therefore, a method that can maintain high performance in these situations is required. #### Proposed solutions To solve the above problems, the paper proposes a new method of **Multimodal Data Fusion**, which enhances the accuracy and robustness of the model by combining caption - based semantic features and visual data. Specifically, it includes the following points: - **High - dimensional description extraction**: Use image caption generation technology to generate text descriptions (High - level Description) of the scene and process them through Graph Convolutional Network (GCN). - **Low - dimensional description extraction**: Use Convolutional Neural Networks (CNNs) to extract visual features (Low - level Description) from images. - **Multimodal fusion**: Combine the above two descriptions to improve the robustness of the model against different types of visual corruptions. #### Experimental verification To verify the effectiveness of this method, the author constructs a new benchmark dataset **Places148 - corrupted**, which contains 148 classes of indoor scenes and introduces 15 common types of visual corruptions, each of which is divided into 5 severity levels. The experimental results show that compared with the model using only visual features, the multimodal fusion model has improved performance on both clean and corrupted images, and shows significant robustness especially on severely corrupted images. #### Main contributions 1. **Propose a multimodal method that combines text and visual features** to improve the robustness of indoor scene recognition in the case of visual corruptions. 2. **Introduce and publish the Places148 - corrupted dataset**, providing a new benchmark for studying indoor scene recognition under visual corruptions. 3. **Provide baseline results** for future research reference. Through these contributions, the paper provides a new idea for the field of indoor scene recognition, especially showing great potential in dealing with image corruptions in the real world.

Indoor scene recognition from images under visual corruptions

Robust Scene Inference under Noise-Blur Dual Corruptions

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Indoor Scene Recognition Mechanism Based on Direction-Driven Convolutional Neural Networks

Optimizing Spatial Relationships in GCN to Improve the Classification Accuracy of Remote Sensing Images

Indoor scene recognition through object detection

Corrupted Point Cloud Classification Through Deep Learning with Local Feature Descriptor

Indoor Scene Recognition: An Attention-Based Approach Using Feature Selection-Based Transfer Learning and Deep Liquid State Machine

Semantic-aware scene recognition

A Survey on the Robustness of Computer Vision Models against Common Corruptions

Assessing and Enhancing Robustness of Deep Learning Models with Corruption Emulation in Digital Pathology

Locally Supervised Deep Hybrid Model for Scene Recognition

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions

A Robust Indoor Scene Recognition Method based on Sparse Representation

Scene Classification in Indoor Environments for Robots using Context Based Word Embeddings

InstaIndoor and multi-modal deep learning for indoor scene recognition

Indoor scene recognition by a mobile robot through adaptive object detection

Self-Selection Salient Region-Based Scene Recognition Using Slight-Weight Convolutional Neural Network

Exploiting Object-based and Segmentation-based Semantic Features for Deep Learning-based Indoor Scene Classification

A New Lightweight Hybrid Graph Convolutional Neural Network -- CNN Scheme for Scene Classification using Object Detection Inference