Abstract:Natural scenes contain a wide range of textured motion phenomena which are characterized by the movement of a large amount of particle and wave elements, such as falling snow, wavy water, and dancing grass. In this paper, we present a generative model for representing these motion patterns and study a Markov chain Monte Carlo algorithm for inferring the generative representation from observed video sequences. Our generative model consists of three components. The first is a photometric model which represents an image as a linear superposition of image bases selected from a generic and overcomplete dictionary. The dictionary contains Gabor and LoG bases for point/particle elements and Fourier bases for wave elements. These bases compete to explain the input images and transfer them to a token (base) representation with an O(10(2))-fold dimension reduction. The second component is a geometric model which groups spatially adjacent tokens (bases) and their motion trajectories into a number of moving elements--called "motons." A moton is a deformable template in time-space representing a moving element, such as a falling snowflake or a flying bird. The third component is a dynamic model which characterizes the motion of particles, waves, and their interactions. For example, the motion of particle objects floating in a river, such as leaves and balls, should be coupled with the motion of waves. The trajectories of these moving elements are represented by coupled Markov chains. The dynamic model also includes probabilistic representations for the birth/death (source/sink) of the motons. We adopt a stochastic gradient algorithm for learning and inference. Given an input video sequence, the algorithm iterates two steps: 1) computing the motons and their trajectories by a number of reversible Markov chain jumps, and 2) learning the parameters that govern the geometric deformations and motion dynamics. Novel video sequences are synthesized from the learned models and, by editing the model parameters, we demonstrate the controllability of the generative model.

Pursuing Atomic Video Words By Information Projection

Analysis and synthesis of textured motion: particles and waves.

Video Primal Sketch: A Unified Middle-Level Representation for Video

Analyzing The Language of Visual Tokens

Analysis and Synthesis of Textured Motion: Particle, Wave and Cartoon Sketch

Splatter a Video: Video Gaussian Representation for Versatile Processing

Modeling Textured Motion : Particle, Wave and Sketch.

Compositional Video Generation as Flow Equalization

Towards Smooth Video Composition

Video Probabilistic Diffusion Models in Projected Latent Space

Spatially Coherent Interpretations of Videos Using Pattern Theory

VideoTetris: Towards Compositional Text-to-Video Generation

Turning Text and Imagery into Captivating Visual Video

Pattern Theory for Representation and Inference of Semantic Structures in Videos

Visual Word Proximity and Linguistics for Semantic Video Indexing and Near-Duplicate Retrieval

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Generating descriptive visual words and visual phrases for large-scale image applications

Generative visual common sense: Testing analysis-by-synthesis on Mondrian-style image.

Dynamical Textures Modeling Via Joint Video Dictionary Learning

Modeling Complex Motion: Photometric, Geometric, Dynamic, and Topological Aspects