Abstract:In this paper, we address the tasks of detecting, segmenting, parsing, and matching deformable objects. We use a novel probabilistic object model that we call a hierarchical deformable template (HDT). The HDT represents the object by state variables defined over a hierarchy (with typically five levels). The hierarchy is built recursively by composing elementary structures to form more complex structures. A probability distribution--a parameterized exponential model--is defined over the hierarchy to quantify the variability in shape and appearance of the object at multiple scales. To perform inference--to estimate the most probable states of the hierarchy for an input image--we use a bottom-up algorithm called compositional inference. This algorithm is an approximate version of dynamic programming where approximations are made (e.g., pruning) to ensure that the algorithm is fast while maintaining high performance. We adapt the structure-perceptron algorithm to estimate the parameters of the HDT in a discriminative manner (simultaneously estimating the appearance and shape parameters). More precisely, we specify an exponential distribution for the HDT using a dictionary of potentials, which capture the appearance and shape cues. This dictionary can be large and so does not require handcrafting the potentials. Instead, structure-perceptron assigns weights to the potentials so that less important potentials receive small weights (this is like a "soft" form of feature selection). Finally, we provide experimental evaluation of HDTs on different visual tasks, including detection, segmentation, matching (alignment), and parsing. We show that HDTs achieve state-of-the-art performance for these different tasks when evaluated on data sets with groundtruth (and when compared to alternative algorithms, which are typically specialized to each task).

Learning hierarchical poselets for human parsing

Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition.

Hierarchical Human Parsing with Typed Part-Relation Reasoning

Hierarchical Human Semantic Parsing With Comprehensive Part-Relation Modeling

Kinematic Skeleton Graph Augmented Network for Human Parsing

Differentiable Multi-Granularity Human Parsing

Deep Hierarchical Human Semantic Parsing

Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing

Learning Visual Symbols for Parsing Human Poses in Images

Human Parsing Using Stochastic And-or Grammars and Rich Appearances

Finer-Net: Cascaded Human Parsing with Hierarchical Granularity

Single-stage Multi-human Parsing via Point Sets and Center-based Offsets

A Deep Structure for Human Pose Estimation

A Hierarchical Model for Human Action Recognition from Body-Parts

Hierarchical Generation Of Human Pose With Part-Based Layer Representation

Learning a hierarchical deformable template for rapid deformable object parsing

Learning Pose Grammar for Monocular 3 D Pose Estimation

From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing

Attributed Grammars for Joint Estimation of Human Attributes, Part and Pose

Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark

Human Parsing via Shape Boltzmann Machine Networks.