Abstract:We propose a framework for parsing video and text jointly for understanding events and answering user queries. Our framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events) and causal information (causalities between events and fluents) in the video and text. The knowledge representation of our framework is based on a spatial-temporal-causal And-Or graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. We present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs and the joint parse graph. Based on the probabilistic model, we propose a joint parsing system consisting of three modules: video parsing, text parsing and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text respectively. The joint inference module produces a joint parse graph by performing matching, deduction and revision on the video and text parse graphs. The proposed framework has the following objectives: Firstly, we aim at deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; Secondly, we perform parsing and reasoning across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG representation; Thirdly, we show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where and why. We empirically evaluated our system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.

SAVE: A Framework for Semantic Annotation of Visual Events

Creating Personalized Video Summaries Via Semantic Event Detection

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Semantic Annotation for Complex Video Street Views Based on 2D–3D Multi-Feature Fusion and Aggregated Boosting Decision Forests

SAVE: Segment Audio-Visual Easy way using Segment Anything Model

Crowd Sensing Based Semantic Annotation of Surveillance Videos.

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images

Visual Semantic Role Labeling for Video Understanding

High-level semantic video annotation based on 3D scene structure analysis

Video Structural Description: A Semantic Based Model for Representing and Organizing Video Surveillance Big Data

Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base.

Joint Video and Text Parsing for Understanding Events and Answering Queries

SAVE: Sensor anomaly visualization engine

Video Data Mining: Semantic Indexing and Event Detection from the Association Perspective

Semantic based representing and organizing surveillance big data using video structural description technology

Video structural description technology for the new generation video surveillance systems

A Representative-Based Framework For Parsing And Summarizing Events In Surveillance Videos

Semantic event detection via multimodal data mining

Visual Semantic Multimedia Event Model for Complex Event Detection in Video Streams

Learning and Parsing Video Events with Goal and Intent Prediction

Discovery of Shared Semantic Spaces for Multi-Scene Video Query and Summarization