Abstract:Automatic recognition of species is important for the conservation and management of biodiversity. However, since closely related species are visually similar, it is difficult to distinguish them by images alone. In addition, traditional species-recognition models are limited by the size of the dataset and face the problem of poor generalization ability. Visual-language models such as Contrastive Language-Image Pretraining (CLIP), obtained by training on large-scale datasets, have excellent visual representation learning ability and demonstrated promising few-shot transfer ability in a variety of few-shot species recognition tasks. However, limited by the dataset on which CLIP is trained, the performance of CLIP is poor when used directly for few-shot species recognition. To improve the performance of CLIP for few-shot species recognition, we proposed a few-shot species-recognition method incorporating geolocation information. First, we utilized the powerful feature extraction capability of CLIP to extract image features and text features. Second, a geographic feature extraction module was constructed to provide additional contextual information by converting structured geographic location information into geographic feature representations. Then, a multimodal feature fusion module was constructed to deeply interact geographic features with image features to obtain enhanced image features through residual connection. Finally, the similarity between the enhanced image features and text features was calculated and the species recognition results were obtained. Extensive experiments on the iNaturalist 2021 dataset show that our proposed method can significantly improve the performance of CLIP's few-shot species recognition. Under ViT-L/14 and 16-shot training species samples, compared to Linear probe CLIP, our method achieved a performance improvement of 6.22% (mammals), 13.77% (reptiles), and 16.82% (amphibians). Our work provides powerful evidence for integrating geolocation information into species-recognition models based on visual-language models.

Visual-Language Collaborative Representation Network for Broad-Domain Few-Shot Image Classification

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Few-Shot Common-Object Reasoning Using Common-Centric Localization Network

Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-modal Inference and Fusion

Learning transferable cross-modality representations for few-shot hyperspectral and LiDAR collaborative classification

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Cross-Domain Few-Shot Classification based on Lightweight Res2Net and Flexible GNN

Hybrid Feature Collaborative Reconstruction Network for Few-Shot Fine-Grained Image Classification

Saliency-Guided Mutual Learning Network for Few-shot Fine-grained Visual Recognition

Multi-branch Collaborative Learning Network for 3D Visual Grounding

CLIP-Driven Few-Shot Species-Recognition Method for Integrating Geographic Information

Not All Instances Contribute Equally: Instance-Adaptive Class Representation Learning for Few-Shot Visual Recognition

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Category Relevance Redirection Network for Few-Shot Classification

Few-Shot Fine-Grained Image Classification via Multi-Frequency Neighborhood and Double-Cross Modulation

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts