Abstract:A visual scene can be recursively parsed into parts and sub-parts. We present a causal model of scene parsing that can synthesize and identify parse trees, predict perceptual complexity, and pass a (limited) Turing test. It describes the causal process of generating a scene as splitting it recursively with vertical and horizontal cuts, modulated by two parameters: (a) Splitting Factor (SF) – a large SF favors splitting a part into more sub-parts, generating a wide and shallow tree; (b) Part Similarity (PS) – a large PS favors evenly splitting a part into sub-parts. Given a scene, it infers the best parsing tree by evaluating the number of parts and their similarities at each partition. It inspires three human experiments beyond RT/accuracy measurements. (1) "Just cut it". Participants freely cut a blank scene into 6 rectangles recursively. With human-generated images, the model removed all free parameters by estimating human SF and PS priors. (2) "Complexity comparison". Some scenes are immediately perceived as more complex than others. Scene complexity can be quantified as "information content", determined by the probability of generating that scene (higher probability carries less information). Participants ranked the complexities of 20 scenes via paired comparisons. The model ranked the same images by computing their information contents. The result revealed a strong correlation between human and model rankings (r2 = 0.85). (3) Turing test. Each scene can be interpreted by multiple parsing trees. Participants viewed a scene and one parsing tree, then reported whether the tree was generated by a human or machine. Two baseline models were introduced: (a) the causal model with non-informative SF and PS priors; (b) a model uniformly sampling a tree from valid ones. Only the causal model with human priors passed Turing test. These results demonstrate how to formalize human scene parsing with a causal model. Meeting abstract presented at VSS 2018

Recurrent Convolutional Neural Networks for Scene Parsing

Multi-Path Feedback Recurrent Neural Network for Scene Parsing

Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection

Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Scene Parsing Using Inference Embedded Deep Networks

Perspective-Adaptive Convolutions for Scene Parsing

Convolutional Neural Networks with Intra-Layer Recurrent Connections for Scene Labeling

Unified Perceptual Parsing for Scene Understanding

Scene Parsing through ADE20K Dataset

MoE-SPNet: A Mixture-of-experts Scene Parsing Network.

Scene Graph Parsing as Dependency Parsing

Face Parsing via Recurrent Propagation

EFRNet: Efficient Feature Reconstructing Network for Real-Time Scene Parsing

A Causal Model of Recursive Scene Parsing in Human Perception

Open Vocabulary Scene Parsing.

A Convolutional Recurrent Neural-Network-Based Machine Learning for Scene Text Recognition Application

Channel and Spatial Enhancement Network for human parsing

Structural inference embedded adversarial networks for scene parsing

Visual and auditory scene parsing

Geometric Scene Parsing with Hierarchical LSTM

A holistic representation guided attention network for scene text recognition