A hierarchical and contextual model for learning and recognizing highly variant visual categories

Song Chun Zhu,Jacob Matthew Porway
2010-01-01
Abstract:In this dissertation we present a hierarchical and contextual model for representing image patterns (manmade objects and aerial images) that are highly variant from instance to instance. These types of patterns are difficult to model because objects within the same class may have very different photometric and geometric properties and/or compositions of parts, e.g. teapots may have very different colors, shapes, and locations of their spouts and handles. We hypothesize that these varied visual patterns can be captured by using a novel representation that arranges common primitives of the patterns in a probabilistic hierarchy, thus compactly capturing possible compositional variations, and then enforces contextual constraints on the appearances of the parts, thus modeling the conditional photometric and geometric relationships of the object parts. We combine a Stochastic Context Free Grammar (SCFG), which captures the long-range compositional variations of a pattern, with a Markov Random Field (MRF), which captures the short-range constraints between neighboring pattern primitives, to create our model. We also present a minimax entropy framework for automatically learning which contextual constraints are most relevant for modeling a type of pattern and estimating their parameters. Finally, we present a novel Markov Chain Monte Carlo (MCMC) algorithm called Clustering Cooperative and Competitive Constraints (C 4) for efficiently performing Bayesian inference with our model. C4 is a method for minimizing energy functions defined on graphs that we will use to combine bottom-up and top-down information to find the best interpretation of an image. We show experiments on learning models of a number of manmade object categories and of aerial images and demonstrate that our algorithms automatically learn models that accurately capture the statistical nature of the patterns we are modeling. We also show that our model can be used for inference in new images, allowing it to identify objects in challenging scenarios.
What problem does this paper attempt to address?