Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Ruiping Liu,Jiaming Zhang,Kunyu Peng,Junwei Zheng,Ke Cao,Yufan Chen,Kailun Yang,Rainer Stiefelhagen

2023-07-15

Abstract:Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU system using a solid pure transformer backbone to improve the performance of GSR. In order to accelerate the convergence, we replace all the activation functions within the GSR decoders with GELU, thereby reducing the training duration. In quantitative analysis, our model achieves state-of-the-art performance on the SWiG dataset. Moreover, through field testing on dedicated assistive technology datasets and application demonstrations, the proposed OpenSU system can be used to enhance scene understanding and facilitate the independent mobility of people with visual impairments. Our code will be available at <a class="link-external link-https" href="https://github.com/RuipingL/OpenSU" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Human-Computer Interaction,Robotics,Image and Video Processing

What problem does this paper attempt to address?

The paper primarily aims to address the challenges faced by People with Visual Impairments (PVI) in understanding their surroundings, particularly in scene understanding. Specifically, the paper proposes a system called "Open Scene Understanding" (OpenSU), which aims to help visually impaired individuals better perceive and understand their environment by combining the capabilities of Grounded Situation Recognition (GSR) and the Segment Anything Model (SAM). The core contributions of the paper are: 1. **Designing an Open Scene Understanding System**: This system not only recognizes activities, entities, and their role information in a scene but also generates pixel-level segmentation masks to provide more precise object location information. This is crucial for visually impaired individuals as it helps them navigate their daily lives more independently. 2. **Achieving Real-Time Open Scene Understanding**: The researchers proposed a state-of-the-art GSR model that combines the Swin Transformer and CoFormer architecture and uses the GELU activation function to shorten training time. Additionally, for the first time, two different SAM variants were applied to assistive technology to help visually impaired individuals. 3. **Experimental Validation**: A series of experiments and field tests were conducted to validate the effectiveness and efficiency of the OpenSU system. Compared to existing methods, this system achieved significant improvements in multiple evaluation metrics, such as outperforming the previous best model on the SWiG dataset. In summary, by proposing a novel scene understanding method that combines GSR and SAM, this paper provides a more accurate and practical means of environmental perception for visually impaired individuals.

Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Unifying Visual Localization and Scene Recognition for People with Visual Impairment

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

Scene Text Detection and Recognition System for Visually Impaired People in Real World

A New Approach of Point Cloud Processing and Scene Segmentation for Guiding the Visually Impaired

OpenSU3D: Open World 3D Scene Understanding using Foundation Models

SU-SAM: A Simple Unified Framework for Adapting Segment Anything Model in Underperformed Scenes

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

OpenSD: Unified Open-Vocabulary Segmentation and Detection

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Transcending Pixels: Boosting Saliency Detection via Scene Understanding from Aerial Imagery

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Unifying Terrain Awareness for the Visually Impaired through Real-Time Semantic Segmentation

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Active Scene Understanding via Online Semantic Reconstruction

Grounded situation recognition under data scarcity

Open-vocabulary Panoptic Segmentation with Embedding Modulation

Adapting Segment Anything Model for Unseen Object Instance Segmentation