Abstract:A typical scene category, e.g., street and beach, contains an enormous number (e.g., in the order of 104 to 105) of distinct scene configurations that are composed of objects and regions of varying shapes in different layouts. A well-known representation that can effectively address such complexity is the family of compositional models; however, learning the structures of the hierarchical compositional models remains a challenging task in vision. The objective of this paper is to present an efficient method for learning such models from a set of scene configurations. We start with an over-complete representation called Hierarchical Space Tiling (HST), which quantizes the huge and continuous scene configuration space in an And-Or tree (AOT). This hierarchical AOT can generate a combinatorial number of configurations (in the order of 1031) through a small dictionary of elements. Then we estimate the HST/AOT model through a learning-by-parsing strategy, which iteratively updates the HST/AOT parameters while constructing the optimal parse trees for each training configuration. Finally we prune out the branches with zero or low probability to obtain a much smaller HST/AOT. The HST quantization allows us to transfer the challenging structure-learning problem to a tractable parameter-learning problem. We evaluate the representation in three aspects. (i) Coding efficiency. We show the learned representation can approximate valid configurations with less errors using smaller number of primitives than other popular representations. (ii) Semantic power of learning. The learned representation is less ambiguous in parsing configuration and has semantically meaningful inner concepts. It captures both the diversity and the frequency (prior) of the scene configurations. (iii) Scene classification. The model is not only fully generative but also yields discriminative scene classification performance which outperforms the state-of-the-art methods.

Learning reconfigurable scene representation by tangram model

A Reconfigurable Tangram Model for Scene Representation and Categorization

Hierarchical space tiling for scene modeling

Learning from the Tangram to Solve Mini Visual Tasks

Learning from the Tangram to Solve Mini Visual Tasks.

Reconfigurable models for scene recognition

Learning Hierarchical Space Tiling for Scene Modeling, Parsing and Attribute Tagging

Compositional scene modeling with global object-centric representations

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

Learning to Compose Dynamic Tree Structures for Visual Contexts

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

Holistic 3 D Indoor Scene Parsing and Reconstruction from a Single RGB Image

Towards a Unified Compositional Model for Visual Pattern Modeling

Iterative Reorganization with Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning

Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

Multiview Scene Graph

Single-Image 3D Scene Parsing Using Geometric Commonsense

Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

Single-View 3D Scene Reconstruction and Parsing by Attribute Grammar

Open-Vocabulary Octree-Graph for 3D Scene Understanding