Learning spatial-temporal models for understanding actions and events in video

Song-Chun Zhu,Zhenyu Yao
2011-01-01
Abstract:Developing statistical models for understanding actions and events in video data is at the core interest of vision community. Inspired by the hypothesis that there are two visual streams in a biological vision system, namely ventral stream (spatial information) and dorsal stream (temporal information), we present models that integrate information from both channels. Learned from observed video examples, these models can help us understand the underlying structure of the high dimensional video space, as well as provide computation schemes for computer vision tasks such as action detection and event recognition. In specific, this thesis studies the following three types of spatial-temporal models: 1. Spatial-temporal deformable template model. Specifically designed for short-term actions, a spatial-temporal deformable template model is composed of a sequence of image templates, each of which consists of shape and motion primitives. The model is "deformable" because: 1) spatial deformations achieved by locally perturbing primitives in position and orientation; 2) temporal deformations achieved by a time-warping algorithm. 2. Animated templates model. Designed for both short-term and long-term actions, the animated templates model is built upon a hierarchical AND-OR templates model, in which an object is usually composed by several constituent parts (e.g. a person is composed of head, body, arms and feet). The original AND-OR templates model is meant for shape (spatial) information only, we incorporate temporal information by: 1) adding short-term motion features and 2) allowing long-term motion information encoded in a hidden Markov model. 3. Spatial-temporal contextual model. This model is designed for representing spatial-temporal relationships between multiple objects within a video scene. The relations in our model include: 1) spatial relation between subjects and semantic objects, 2) spatial relation between multiple objects in one frame; 3) temporal relation between several frames; 4) relation accounting for spatial-temporal interactions between multiple objects.
What problem does this paper attempt to address?