Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera

Inpyo Song,Minjun Joo,Joonhyung Kwon,Jangwon Lee

2024-05-30

Abstract:This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction. To alleviate these challenges, we introduce a novel visual question answering dataset. Our dataset offers two significant advancements over previous datasets: Firstly, it features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings, departing from the static image-centric nature of prior datasets. Secondly, unlike datasets centered on singular challenges, ours addresses multiple real-life obstacles simultaneously through an innovative visual-question answering framework. We validate our dataset using various state-of-the-art VideoQA methods and diverse metrics. Results indicate that while progress has been made, satisfactory performance levels for AI-powered assistive services remain elusive for visually impaired individuals. Additionally, our evaluation highlights the distinctive features of the proposed dataset, featuring ego-motion in videos captured via 360-degree cameras across varied scenarios.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily addresses the challenges faced by Visually Impaired Persons (VIPs) in their daily lives, such as limited access to information, navigation difficulties, and social interaction barriers. It proposes a new video question-answering dataset called VIEW-QA (Visually Impaired Equipped with Wearable 360-degree camera Question Answering). The paper aims to solve the following core issues by constructing this dataset: 1. **Improving the quality of life for visually impaired persons**: By developing a dataset based on a 360-degree panoramic wearable camera, it helps visually impaired individuals better understand their surroundings, thereby enhancing their quality of life and independence. 2. **Covering various daily challenges**: Unlike previous datasets that focus on a single task, the VIEW-QA dataset is designed to simultaneously address multiple real-world challenges faced by visually impaired persons, including social interaction, environmental perception, object recognition, navigation, and safety issues. 3. **Utilizing dynamic visual input**: Compared to existing datasets that rely on static images, VIEW-QA uses video format, which can capture more dynamic and complex scene changes, making it more aligned with the actual needs of visually impaired persons. 4. **Promoting the development of AI-assisted technologies**: By introducing a dataset that includes multi-faceted questions and answer annotations, it provides resources for developing AI systems capable of effectively interpreting complex visual scenes and providing timely relevant information to visually impaired persons. In summary, this research aims to advance AI-assisted technologies by constructing the VIEW-QA dataset, thereby better supporting visually impaired persons in overcoming various challenges in their daily lives.

Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera

Grounded Question-Answering in Long Egocentric Videos

Video Question Answering for Surveillance

Video Question Answering: Datasets, Algorithms and Challenges

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Video Question Answering Via Grounded Cross-Attention Network Learning.

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Data augmentation techniques for the Video Question Answering task

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Equivariant and Invariant Grounding for Video Question Answering

Methodology to Assess Quality, Presence, Empathy, Attitude, and Attention in 360-degree Videos for Immersive Communications

Depth and Video Segmentation Based Visual Attention for Embodied Question Answering

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks

Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Spatially Visual Perception for End-to-End Robotic Learning

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Harnessing Representative Spatial-Temporal Information for Video Question Answering

AVQA: A Dataset for Audio-Visual Question Answering on Videos

Explore until Confident: Efficient Exploration for Embodied Question Answering

Hexamethyidisiloxane: A 13-week subchronic whole-body vapor inhalation toxicity study in Fischer 344 rats.