Abstract:Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image–text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image–text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image–text conservation area image datasets.

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Multimodal Contrastive Training for Visual Representation Learning

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment.

Contrastive Learning Based on Multiscale Hard Features for Remote-Sensing Image Scene Classification.

Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning

Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

What makes for good views for contrastive learning

Improving Contrastive Learning on Visually Homogeneous Mars Rover Images

Multi-task contrastive learning for change detection in remote sensing images

What Makes for Good Views for Contrastive Learning?

Vision-Language Pre-Training with Triple Contrastive Learning

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Contrastive Learning for Urban Land Cover Classification With Multimodal Siamese Network

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training.

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining

CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations

Non-Contrastive Learning Meets Language-Image Pre-Training

Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

HAPiCLR: heuristic attention pixel-level contrastive loss representation learning for self-supervised pretraining