Abstract:This dissertation presents a composite template model, named And-Or graph for representing objects with large structural variabilities. Intuitively, an And-node represents a decomposition of certain graphical structures which expands to a set of Or-nodes with associated relations; an Or-node serves as a set of switch variable pointing to alternative And-nodes. A traversal from the root node of the And-Or graph, named the parse graph, produces a configuration of the terminal nodes (sub-templates) under (soft and hard) relations inherited from their ancestor nodes. The And-Or graph representation can generate a large set of constrained configurations with relatively small number of graph nodes, thus account for great structural variations. The And-Or graph model is tested on tasks as modeling and sketching human faces and clothes. A hierarchical-compositional model of human faces, as a three-layer And-Or graph is built. Faces are represented hierarchically: the first layer treats each face as a whole; the second layer refines the local facial parts jointly as a set of individual templates; the third layer further divides face into 16 zones and models detail facial features such as eye corners, marks or wrinkles. Transitions between the layers are realized by measuring the minimum description length (MDL) given the complexity of an input face image. Diverse face representations are formed by drawing from dictionaries of global faces, parts and skin detail features. A sketch captures the most informative part of a face in a much more concise and potentially robust representation. However, generating good facial sketches is extremely challenging because of the rich facial details and large structural variations, especially in the high-resolution images. The representing power of our generative model is demonstrated by reconstructing high-resolution face images and generating the cartoon facial sketches. Our model is useful for a wide variety of applications, including recognition, non-photorealistic rendering, super-resolution, and low-bit rate face coding. Cloth modeling and recognition is an important and challenging problem in both vision and graphics tasks, such as dressed human recognition and tracking, human sketch and portrait. We built a And-Or graph model to represent different clothes configurations, such as T-shirts, jackets, etc. In a supervised learning phase, we ask an artist to draw sketches on a set of dressed people, and we decompose the sketches into categories of cloth and body components: collars, shoulders, cuff, hands, pants, shoes, etc. Each component has a number of distinct sub-templates (sub-graphs). An algorithm which integrates the bottom-up proposals and the top-down information is proposed to infer the composite clothes template efficiently from the image.

Towards a Unified Compositional Model for Visual Pattern Modeling

Visualizing and Understanding Neural Models in NLP

Learning to Infer Unseen Single-/ Multi-Attribute-Object Compositions with Graph Networks.

Rates for Inductive Learning of Compositional Models

Learning to Infer Unseen Attribute-Object Compositions

Compositional diversity in visual concept learning

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

Compositional Generalization by Learning Analytical Expressions.

Learning Unseen Concepts Via Hierarchical Decomposition and Composition

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

A hierarchical compositional model for representation and sketching of high-resolution human images

A hierarchical compositional model for face representation and sketching.

Compositional Zero-shot Learning Via Progressive Language-based Observations

Flexible Compositional Learning of Structured Visual Concepts

Compositional Structure Learning for Action Understanding

What makes Models Compositional? A Theoretical View: With Supplement

Learning to Compose Representations of Different Encoder Layers towards Improving Compositional Generalization

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Semantic Part Segmentation using Compositional Model combining Shape and Appearance