Abstract:Developing statistical models for understanding actions and events in video data is at the core interest of vision community. Inspired by the hypothesis that there are two visual streams in a biological vision system, namely ventral stream (spatial information) and dorsal stream (temporal information), we present models that integrate information from both channels. Learned from observed video examples, these models can help us understand the underlying structure of the high dimensional video space, as well as provide computation schemes for computer vision tasks such as action detection and event recognition. In specific, this thesis studies the following three types of spatial-temporal models: 1. Spatial-temporal deformable template model. Specifically designed for short-term actions, a spatial-temporal deformable template model is composed of a sequence of image templates, each of which consists of shape and motion primitives. The model is "deformable" because: 1) spatial deformations achieved by locally perturbing primitives in position and orientation; 2) temporal deformations achieved by a time-warping algorithm. 2. Animated templates model. Designed for both short-term and long-term actions, the animated templates model is built upon a hierarchical AND-OR templates model, in which an object is usually composed by several constituent parts (e.g. a person is composed of head, body, arms and feet). The original AND-OR templates model is meant for shape (spatial) information only, we incorporate temporal information by: 1) adding short-term motion features and 2) allowing long-term motion information encoded in a hidden Markov model. 3. Spatial-temporal contextual model. This model is designed for representing spatial-temporal relationships between multiple objects within a video scene. The relations in our model include: 1) spatial relation between subjects and semantic objects, 2) spatial relation between multiple objects in one frame; 3) temporal relation between several frames; 4) relation accounting for spatial-temporal interactions between multiple objects.

Animated Pose Templates for Modelling and Detecting Human Actions.

Based on network probabilistic graph human pose segmentation algorithm

Learning Deformable Action Templates from Cluttered Videos

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

Video Action Detection With Relational Dynamic-Poselets

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Learning realistic human actions from movies.

Articulated Human Detection with Flexible Mixtures of Parts

Kpose: A New Representation For Action Recognition

PoseFlow: A Deep Motion Representation for Understanding Human Behaviors in Videos

Flowpose: Conditional Normalizing Flows for 3D Human Pose and Shape Estimation from Monocular Videos

Pose for Action - Action for Pose

Learning spatial-temporal models for understanding actions and events in video

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets

Pose-aware video action segmentation

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

Action2video: Generating Videos of Human 3D Actions

Dynamic gesture retrieval: searching videos by human pose sequence

Human Activity Recognition with Posture Tendency Descriptors on Action Snippets

I Know How You Move: Explicit Motion Estimation for Human Action Recognition