VSA: Adaptive Visual and Semantic Guided Attention on Few-Shot Learning

Jin Chai,Yisheng Chen,Weinan Shen,Tong Zhang,C. L. Philip Chen
DOI: https://doi.org/10.1007/978-3-031-20497-5_23
2022-01-01
Abstract:Training models with only a few samples often bring overfitting and generalization problems. Moreover, it has always been challenging to identify new classes based on small samples. However, studies have shown that humans can use prior knowledge such as vision and semantics to learn new categories from a small number of samples. We propose a bimodal attention mechanism (VSA) based on vision and semantics to better use this prior knowledge like humans. VSA can adaptively combine information from both visual and semantic modalities to guide visual feature extraction, that is, which features should be paid more attention to during feature extraction. Therefore, the new category is more discriminative even if only one sample exists. Meanwhile, our extensive experiments on miniImageNet, CIFAR-FS, and CUB demonstrate that our bimodal attention mechanism is effective and achieves state-of-the-art results on the CUB dataset.
What problem does this paper attempt to address?