Abstract:Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding what is not present. Specifically, given an image (e.g. of a living room) and name of an object ("cushion"), a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space). Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context from web, and then remove that object from the image via inpainting. This automated pipeline converts unstructured web data into a dataset comprising pairs of images with/without the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images across $9$ object categories, and train a SP prediction model called CLIP-UNet. CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors on real-world and simulated images. In our user studies, we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and $31.3\%$ times when comparing against the $4$ SP baselines on real and simulated images. In addition, we demonstrate leveraging SP mask predictions from CLIP-UNet enables downstream applications like building tidying robots in indoor environments.

Learning Common Sense Through Visual Abstraction Supplementary Material

Visually Grounded Commonsense Knowledge Acquisition

CommonsenseVIS: Visualizing and Understanding Commonsense Reasoning Capabilities of Natural Language Models

Building a commonsense knowledge base for context-awareness inference

Know2Look: Commonsense Knowledge for Visual Search

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

A Manual Experiment On Commonsense Knowledge Acquisition From Web Corpora

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

VCD: Knowledge Base Guided Visual Commonsense Discovery in Images

Refined Commonsense Knowledge from Large-Scale Web Contents

Commonsense Learning: An Indispensable Path towards Human-centric Multimedia

Seeing the Unseen: Visual Common Sense for Semantic Placement

Things not Written in Text: Exploring Spatial Commonsense from Visual Signals

What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge

Learning Visual Commonsense for Robust Scene Graph Generation

A framework for quantifying individual and collective common sense

Commonsense Scene Semantics for Cognitive Robotics: Towards Grounding Embodied Visuo-Locomotive Interactions

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

What Really is Commonsense Knowledge?

Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms

CAT: A Contextualized Conceptualization and Instantiation Framework for Commonsense Reasoning