Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Zhaoheng Zheng,Jingmin Wei,Xuefeng Hu,Haidong Zhu,Ram Nevatia

2024-04-03

Abstract:Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we discuss the integration of LLMs to enhance pre-trained VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11 datasets. Code will be made available at: <a class="link-external link-https" href="https://github.com/zhaohengz/LLaMP" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper mainly discusses how to enhance low-shot image classification using large language models (LLMs). In low-shot image classification tasks, researchers rely on pre-trained visual language models like CLIP to extract limited information from class names due to the limited or difficult-to-obtain training images. However, current methods face challenges in distinguishing fine-grained target categories. The paper introduces LLaMP (Large Language Models as Prompt learners), which improves the CLIP text encoder by generating adaptive prompts using LLMs to provide richer category-specific information. In this way, LLaMP outperforms other state-of-the-art methods in zero-shot and few-shot image classification and demonstrates average performance improvement on 11 datasets. In short, the paper attempts to address how to leverage the knowledge of LLMs to enhance the performance of low-shot image classification.

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Language Models as Black-Box Optimizers for Vision-Language Models

Large Language Models Are Zero-Shot Text Classifiers

Domain-Controlled Prompt Learning

The Neglected Tails in Vision-Language Models

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Large Language Models can Share Images, Too!

Large Language Models are Strong Zero-Shot Retriever

Multi-modal Attribute Prompting for Vision-Language Models

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

Aligning Medical Images with General Knowledge from Large Language Models

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

Visual Classification via Description from Large Language Models

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation