Abstract:It is common experience for human vision to perceive full 3D shape and scene from a single 2D image with the occluded parts "filled-in" by prior visual knowledge. Thus, computing the 3D structures of all the objects in the scene from a single image is a fundamental problem in computer vision. In this thesis, we propose a bottom-up/top-down Bayesian inference framework to compute the 3D structures of objects in the scene from a single image, which integrates the involved visual tasks (segmentation, perceptual grouping, object detection and recognition, 3D reconstruction) in a principled way and incorporates the prior visual knowledge in the inference. The output of the inference framework is a hierarchical "parsing graph" with the scene label at the top (or root), objects with 3D structures and their parts at intermediate nodes, and image pixels at the bottom. The number of layers in this parsing graph is determined by the types of objects or visual patterns. The nodes in this parsing graph correspond to visual patterns represented by probabilistic models. The parsing graph also has both top-down connections and horizontal spatial connections, which correspond to the generative models and spatial relations modeled by Markov Random Field (MRF) respectively. Formulated in Bayesian framework, the inference algorithm computes the parsing graph from the input image by optimizing a posterior probability. In this optimization process, we integrate two popular computing paradigms in computer vision: generative methods, and discriminative methods. The former formulates the posterior probability to maximize in terms of generative models for images defined by likelihood functions and priors. The latter computes discriminative proposals using some bottom-up tests to drive the maximizing process in the solution space. Thus, the inference algorithm achieves both speed and consistency. We also investigate three mechanisms to efficiently construct the parsing graph based on the properties of visual patterns being computed: bottom-up construction mechanism, top-down construction mechanism, and bottom-up/top-down construction mechanism.

Scene parsing by data-driven cluster sampling

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Image Parsing Via Stochastic Scene Grammar

Image Parsing: Unifying Segmentation, Detection, and Recognition.

Single-Image 3D Scene Parsing Using Geometric Commonsense

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

Discovering Scene Categories by Information Projection and Cluster Sampling

Scene Parsing through ADE20K Dataset

Reasoning Geometric Commonsense for Single-view 3D Scene Parsing

3D Scene Parsing via Class-Wise Adaptation

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

Holistic 3 D Indoor Scene Parsing and Reconstruction from a Single RGB Image

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

Integrating Function , Geometry , Appearance for Scene Parsing

Scene Parsing by Integrating Function, Geometry and Appearance Models

Single-View 3D Scene Parsing by Attributed Grammar.

Single-View 3D Scene Reconstruction and Parsing by Attribute Grammar

Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Computing three-dimensional scene from a single image by bottom-up/top-down bayesian inference