Abstract:Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

What problem does this paper attempt to address?

The problem this paper attempts to address is how to extract visual concepts related to language-specified concept axes (such as category, color, material, etc.) from images and recombine these concepts to generate new images. Specifically, the authors aim to develop a framework that learns a language-guided visual concept representation by distilling pre-trained vision-language models. This representation should have the following characteristics: 1. **Fine-grained visual details**: Unlike traditional text-to-image generation, this method aims to capture more subtle visual differences, not just general vocabulary descriptions. 2. **Shared concept structure**: Share the same concepts across different image instances (e.g., "red" can apply to "red apple" and "red dress") and be able to recombine these concepts to generate new images. 3. **Decoupled concept axes**: Ensure the decoupling between each concept axis so that modifying a single concept axis does not affect other axes. To achieve this goal, the authors designed a method that trains a set of concept encoders to extract specific concept embeddings from images and uses pre-trained text-to-image generation models and visual question answering models to optimize these embeddings. The specific steps include: - **Concept embedding extraction**: Train concept encoders to extract information related to specific concept axes from images. - **Reconstruct input images**: Ensure that the extracted concept embeddings can reconstruct the original input images through pre-trained text-to-image generation models. - **Text anchoring**: Query pre-trained visual question answering models (such as BLIP-2) to obtain text embeddings related to specific concept axes as anchors to optimize the decoupling of concept embeddings. Additionally, this method can adapt to unseen concepts during testing through a lightweight fine-tuning process, thereby enhancing the model's generalization ability. In summary, the goal of this paper is to develop an efficient and flexible method for extracting and recombining visual concepts from images to generate new images with novel combinations.

Language-Informed Visual Concept Learning

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Visual Superordinate Abstraction for Robust Concept Learning

Understanding Visual Concepts Across Models

Visual Concept-Metaconcept Learning

Visual In-Context Learning for Large Vision-Language Models

FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Conceptual Codebook Learning for Vision-Language Models

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos

Can Language Models Understand Physical Concepts?

General Image-to-Image Translation with One-Shot Image Guidance

A Hippocampal–Entorhinal System Inspired Model for Visual Concept Representation

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

Visual Concepts Tokenization

Probing Conceptual Understanding of Large Visual-Language Models

Visual Conceptual Blending with Large-scale Language and Vision Models

Learning to Infer Generative Template Programs for Visual Concepts