Better Integrating Vision and Semantics for Improving Few-shot Classification

Zhuoling Li,Yong Wang
DOI: https://doi.org/10.1145/3581783.3613819
2023-01-01
Abstract:Some recent methods address few-shot classification by integrating visual and semantic prototypes. However, they usually ignore the difference in feature structure between the visual and semantic modalities, which leads to limited performance improvements. In this paper, we propose a novel method, called bimodal integrator (BMI), to better integrate visual and semantic prototypes. In BMI, we first construct a latent space for each modality via a variational autoencoder, and then align the semantic latent space to the visual latent space. Through this semantics-to-vision alignment, the semantic modality is mapped to the visual latent space and has the same feature structure as the visual modality. As a result, the visual and semantic prototypes can be better integrated. In addition, based on the multivariate Gaussian distribution and the prompt engineering, a data augmentation scheme is designed to ensure the accuracy of modality alignment during the training process. Experimental results demonstrate that BMI significantly improves few-shot classification, making simple baselines outperform the most advanced methods on miniImageNet and tieredImageNet datasets.
What problem does this paper attempt to address?