Abstract:It is common experience for human vision to perceive full 3D shape and scene from a single 2D image with the occluded parts "filled-in" by prior visual knowledge. Thus, computing the 3D structures of all the objects in the scene from a single image is a fundamental problem in computer vision. In this thesis, we propose a bottom-up/top-down Bayesian inference framework to compute the 3D structures of objects in the scene from a single image, which integrates the involved visual tasks (segmentation, perceptual grouping, object detection and recognition, 3D reconstruction) in a principled way and incorporates the prior visual knowledge in the inference. The output of the inference framework is a hierarchical "parsing graph" with the scene label at the top (or root), objects with 3D structures and their parts at intermediate nodes, and image pixels at the bottom. The number of layers in this parsing graph is determined by the types of objects or visual patterns. The nodes in this parsing graph correspond to visual patterns represented by probabilistic models. The parsing graph also has both top-down connections and horizontal spatial connections, which correspond to the generative models and spatial relations modeled by Markov Random Field (MRF) respectively. Formulated in Bayesian framework, the inference algorithm computes the parsing graph from the input image by optimizing a posterior probability. In this optimization process, we integrate two popular computing paradigms in computer vision: generative methods, and discriminative methods. The former formulates the posterior probability to maximize in terms of generative models for images defined by likelihood functions and priors. The latter computes discriminative proposals using some bottom-up tests to drive the maximizing process in the solution space. Thus, the inference algorithm achieves both speed and consistency. We also investigate three mechanisms to efficiently construct the parsing graph based on the properties of visual patterns being computed: bottom-up construction mechanism, top-down construction mechanism, and bottom-up/top-down construction mechanism.

Hierarchical 3D Perception from a Single Image

Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition.

Computing three-dimensional scene from a single image by bottom-up/top-down bayesian inference

Seeing "what" Through "why": Evidence from Probing the Causal Structure of Hierarchical Motion.

Learning 3D object-centric representation through prediction

Single-Image 3D Scene Parsing Using Geometric Commonsense

A hierarchical and contextual model for learning and recognizing highly variant visual categories

A Hierarchial Model for Visual Perception

Bayesian Reconstruction of 3d Shapes and Scenes from A Single Image

On Support Relations Inference and Scene Hierarchy Graph Construction from Point Cloud in Clustered Environments

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

Single Image 3D Interpreter Network

Single-View 3D Scene Reconstruction and Parsing by Attribute Grammar

Single Image 3D Object Estimation with Primitive Graph Networks

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Graphics Capsule: Learning Hierarchical 3D Face Representations from 2D Images

Understanding 3D Object Interaction from a Single Image

Three-Dimensional Structure Measurement And Optimization Method Of Indoor Scene Based On Single Image

CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

The More You See in 2D, the More You Perceive in 3D

Exploring Hierarchical Spatial Layout Cues for 3D Point Cloud based Scene Graph Prediction