Abstract:Few-shot image classification aims at learning to generalize to unseen new categories from a few training samples. Transfer learning is one prominent approach to the task, which first learns a backbone from the base classes and then trains a classifier on new classes with the prior learned knowledge. Typically, the convolutional neural network (CNN) is the preferred backbone. However, when the samples are limited, the representation ability of the feature extracted by CNN will decrease, thus leading to the performance degradation of few-shot image classification. Recently, the pre-trained large-scale vision-language model like CLIP has shown non-trivial potential, which can be used as a backbone for zero or few-shot transfer on a series of downstream tasks with the prompt. To fully explore the few-shot image classification performance of vision-language models, we propose CoCoOpter, a novel "pre-training + prompt tuning + fine-tuning" paradigm based on CLIP. CoCoOpter alleviates the overfitting and ensures generalizability in unseen new categories. Specifically, it learns an input-specific neural network to relieve overfitting by drawing attention away from a specific category to each specific input sample. Then, to establish connection between the visual and textual signals, it introduces the previously learned visual representations to perform automatic prompt tuning in the middle of the pre-trained CLIP, enabling learning input-specified prompt vectors. Moreover, two learnable lightweight neural networks are added at the end of CLIP to guide information propagation between different classes by fine-tuning both the visual and textual features. We perform extensive experiments on 11 image classification datasets. The results show that CoCoOpter is more generalizable in unseen classes and achieves superior few-shot classification performance with a straightforward design.

Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss

TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining

Multi-Modal Few-Shot Temporal Action Detection

Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

Task-aware prototype refinement for improved few-shot learning

Task Cooperation for Semi-Supervised Few-Shot Learning.

Task-specific alignment and multiple-level transformer for few-shot action recognition

Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

Few-shot Action Recognition with Prototype-centered Attentive Learning

Task-Aware Dual-Representation Network for Few-Shot Action Recognition

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Collaboration of Pre-trained Models Makes Better Few-shot Learner

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer.

Temporal Action Proposal Generation Via Multi-Task Feature Learning.

Few-Shot Scene Classification of Optical Remote Sensing Images Leveraging Calibrated Pretext Tasks