Abstract:There are many task in surveillance monitoring such as object detection, person identification, activity and action recognition etc. Integrating variety of surveillance task through a multimodal interactive system will benefit real-life deployment, and will also support human operators. We first introduce a dataset which is first of its kind and named as Surveillance Video Question Answering (SVideoQA) dataset. The multi-camera surveillance monitoring aspect is considered through the multimodal context of Video Question Answering (VideoQA) in the SVideoQA dataset. This paper proposes a deep learning model where VideoQA task on the SVideoQA dataset is attempted to solved in a manner where memory-driven relationship among appearance and motion aspect of the video features are captured. At each level of the relational reasoning respective attentive parts of the context of the motion and appearance features are identified forwarded through frame level and clip level relational reasoning module. Also, respective memories are updated which are again forwarded to the memory-relation module to finally predict the answer word. The proposed memory-driven multilevel relational reasoning is made compatible with the surveillance monitoring task through the incorporation of multi-camera relation module, which is able to capture and reason over the relationships among the video feeds across multiple cameras. Experimental outcome exhibits that the proposed memory-driven multilevel relational reasoning perform significantly better on the open-ended VideoQA task compared to other state-of-the art systems. The proposed method achieves an accuracy of 57\% and 57.6\% respectively for the single-camera and multi-camera task of the SVideoQA dataset.

Video question answering via traffic knowledge database and question classification

Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks

Video Question Answering: a Survey of Models and Datasets

&Lt;title>automatic Traffic Real-Time Analysis System Based on Video</title>

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Video Question Answering Via Gradually Refined Attention over Appearance and Motion

Video Question Answering: Datasets, Algorithms and Challenges

Video Question Answering Via Grounded Cross-Attention Network Learning.

Multichannel Attention Refinement for Video Question Answering.

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

Question-Aware Tube-Switch Network for Video Question Answering

Unifying the Video and Question Attentions for Open-Ended Video Question Answering.

ActivityNet-QA: A Dataset for Understanding Complex Web Videos Via Question Answering.

Memory Augmented Deep Recurrent Neural Network for Video Question Answering

Video Question Answering for Surveillance

Video Question Answering via Attribute-Augmented Attention Network Learning

Instance-sequence reasoning for video question answering

Frame Augmented Alternating Attention Network for Video Question Answering.

Visual Causal Scene Refinement for Video Question Answering

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

Transformer-Empowered Invariant Grounding for Video Question Answering