Abstract:Humans leverage compositionality to efficiently learn new concepts, understanding how familiar parts can combine together to form novel objects. In contrast, popular computer vision models struggle to make the same types of inferences, requiring more data and generalizing less flexibly than people do. Here, we study these distinctively human abilities across a range of different types of visual composition, examining how people classify and generate ``alien figures'' with rich relational structure. We also develop a Bayesian program induction model which searches for the best programs for generating the candidate visual figures, utilizing a large program space containing different compositional mechanisms and abstractions. In few shot classification tasks, we find that people and the program induction model can make a range of meaningful compositional generalizations, with the model providing a strong account of the experimental data as well as interpretable parameters that reveal human assumptions about the factors invariant to category membership (here, to rotation and changing part attachment). In few shot generation tasks, both people and the models are able to construct compelling novel examples, with people behaving in additional structured ways beyond the model capabilities, e.g. making choices that complete a set or reconfiguring existing parts in highly novel ways. To capture these additional behavioral patterns, we develop an alternative model based on neuro-symbolic program induction: this model also composes new concepts from existing parts yet, distinctively, it utilizes neural network modules to successfully capture residual statistical structure. Together, our behavioral and computational findings show how people and models can produce a rich variety of compositional behavior when classifying and generating visual objects.

Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling

Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling

Knowledge-Enriched Visual Storytelling

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

Informative Visual Storytelling with Cross-modal Rules

A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

Retrieval, Selection and Writing: A Three-Stage Knowledge Grounded Storytelling Model

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

DIVE: Towards Descriptive and Diverse Visual Commonsense Generation

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding

Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication

Storytelling from an Image Stream Using Scene Graphs

TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling

Compositional diversity in visual concept learning

Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

Visually Grounded Commonsense Knowledge Acquisition

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

Diverse and Informative Dialogue Generation with Context-Specific Commonsense Knowledge Awareness