Language-Informed Visual Concept Learning

Sharon Lee,Yunzhi Zhang,Shangzhe Wu,Jiajun Wu
2024-04-03
Abstract:Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is how to extract visual concepts related to language-specified concept axes (such as category, color, material, etc.) from images and recombine these concepts to generate new images. Specifically, the authors aim to develop a framework that learns a language-guided visual concept representation by distilling pre-trained vision-language models. This representation should have the following characteristics: 1. **Fine-grained visual details**: Unlike traditional text-to-image generation, this method aims to capture more subtle visual differences, not just general vocabulary descriptions. 2. **Shared concept structure**: Share the same concepts across different image instances (e.g., "red" can apply to "red apple" and "red dress") and be able to recombine these concepts to generate new images. 3. **Decoupled concept axes**: Ensure the decoupling between each concept axis so that modifying a single concept axis does not affect other axes. To achieve this goal, the authors designed a method that trains a set of concept encoders to extract specific concept embeddings from images and uses pre-trained text-to-image generation models and visual question answering models to optimize these embeddings. The specific steps include: - **Concept embedding extraction**: Train concept encoders to extract information related to specific concept axes from images. - **Reconstruct input images**: Ensure that the extracted concept embeddings can reconstruct the original input images through pre-trained text-to-image generation models. - **Text anchoring**: Query pre-trained visual question answering models (such as BLIP-2) to obtain text embeddings related to specific concept axes as anchors to optimize the decoupling of concept embeddings. Additionally, this method can adapt to unseen concepts during testing through a lightweight fine-tuning process, thereby enhancing the model's generalization ability. In summary, the goal of this paper is to develop an efficient and flexible method for extracting and recombining visual concepts from images to generate new images with novel combinations.