Matching Multi-Scale Feature Sets in Vision Transformer for Few-Shot Classification

Mingchen Song,Fengqin Yao,Guoqiang Zhong,Zhong Ji,Xiaowei Zhang
DOI: https://doi.org/10.1109/tcsvt.2024.3435003
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Recently, Transformer-based few-shot classification methods are widely exploited. However, they only leverage feature information at a single scale, resulting in weak feature representations, which cannot fully capture the rich information contained in a limited number of images regarding diverse objects with different scales, even those belonging to the same category. To mitigate this issue, we propose a multi-scale feature sets matching scheme in vision Transformer for few-shot classification, and name it FSViT, which can sufficiently extract discriminative features from the few number of labeled support examples. Concretely, we establish a patch-based multi-scale feature representation based on the feature extractors of FSViT, where we introduce an attention-aware grid pooling operation to merge adjacent patches with various scales to obtain multi-scale feature sets. Moreover, we devise a multi-scale patch matching metric to aggregate the measurement of similarity over the multi-scale feature sets for few-shot classification. Extensive experiments demonstrate the effectiveness of the proposed FSViT in both 1-shot and 5-shot scenarios on standard single-domain and cross-domain few-shot classification, especially improving the state-of-the-art recognition accuracy by 1.27% and 1.33% on average on the Mini-ImageNet and CFAIR-FS datasets, respectively. The code of FSViT is available at https://github.com/codeshop715/FSViT.
What problem does this paper attempt to address?