Abstract:Modeling object is one of the core problems in computer vision. A good object model can be applied to multiple visual recognition tasks. In this thesis, we use compositional relations to model the objects and parts for articulated objects, which is very challenging due to the large variability by subtypes, viewpoints and poses. Intuitively, objects are composed of parts, and each part is then composed of subparts. By recursively applying the compositional relations, we can obtain a hierarchical structure of the object, which is called compositional model. Compared to the popular black-box deep learning system, our model enjoys the advantages of explainability, and adaptivity to more complex scenarios. There are three challenges for the compositional model: 1) How to learn the parts and subparts, 2) How to learn the compositional structure between parts and subparts, 3) How to allow efficient learning and inference. In this thesis, we mainly focus on three relevant projects using the concept of compositions and parts.In the first project, we study the problem of semantic part segmentation for animals. Compositional model is used to represent boundaries of objects and parts. Given part-level boundary annotations, a mixture of models is learned to deal with various viewpoints and poses. We also incorporate edge, appearance, and semantic part cues into the compositional model. Besides, a linear complexity algorithm is developed for efficient inference. The evaluation using horse and cow images from PASCAL VOC demonstrates the effectiveness of our method.In the second project, we provide a novel unsupervised method to learn object semantic parts from internal states of CNNs. Our hypothesis is that semantic parts are represented by populations of neurons rather than by single filters. We propose a simple clustering technique to extract part representations, which we call visual concepts. We qualitatively show that visual concepts are semantically coherent in that they represent most semantic parts of the object, and visually coherent in that their corresponding image patches appear very similar. Next, we treat the visual concepts as part detectors and quantitatively evaluate their performance for detecting semantic parts. The experiments on the PASCAL3D+ dataset and our newly annotated VehicleSemanticPart dataset support that CNNs have internal representations of object parts encoded by populations of neurons.In the third project, we use the visual concepts for detecting the semantic parts on partially occluded objects. We consider a scenario where the model is trained using non-occluded images but tested on occluded images. The motivation is that there are infinite number of occlusion patterns in real world, so the models should be inherently robust and adaptive to occlusions instead of learning the occlusion patterns in the training data. Our approach uses a simple voting scheme, based on log-likelihood ratio tests and spatial constraints, to combine the evidence of local cues, which are derived from visual concepts. Experiments show that our algorithm outperforms several competitors in semantic part detection, especially in the presence of occlusions.

SCAN: Learning Hierarchical Compositional Visual Concepts

Compositional diversity in visual concept learning

Learning Unseen Concepts Via Hierarchical Decomposition and Composition

Pix2Code: Learning to Compose Neural Visual Concepts as Programs

ZeroC: A Neuro-Symbolic Model for Zero-shot Concept Recognition and Acquisition at Inference Time

Bridging Visual and Textual Semantics: Towards Consistency for Unbiased Scene Graph Generation

A Study of Compositional Generalization in Neural Models

Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

Sample-Efficient Learning of Novel Visual Concepts

A Hippocampal–Entorhinal System Inspired Model for Visual Concept Representation

Learning and generalization of compositional representations of visual scenes

Flexible Compositional Learning of Structured Visual Concepts

SegDiscover: Visual Concept Discovery via Unsupervised Semantic Segmentation

Constellation: Learning relational abstractions over objects for compositional imagination

Visual Superordinate Abstraction for Robust Concept Learning

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations

Modeling Objects and Parts by Compositional Relations

Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder

Hierarchical Prompt Learning for Compositional Zero-Shot Recognition.