Abstract:Capitalizing on vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data often exhibit sub-optimal performance for specialized domains, likely due to domain shift. Recent works have tackled this problem for some domains (e.g., healthcare) by constructing domain-specialized image-text data. However, constructing a dedicated large-scale image-text dataset for sustainable area of agriculture and livestock is still open to research. Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e.g, nutrient deficiency detection, livestock breed classification). To address this we present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset, named ALive, that leverages customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on diverse set of 20 downstream tasks demonstrate the effectiveness of AgriCLIP framework, achieving an absolute gain of 7.8\% in terms of average zero-shot classification accuracy, over the standard CLIP adaptation via domain-specialized ALive dataset. Our ALive dataset and code can be accessible at \href{<a class="link-external link-https" href="https://github.com/umair1221/AgriCLIP/tree/main" rel="external noopener nofollow">this https URL</a>}{Github}.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the poor performance of existing vision - language pre - training models (such as CLIP) in the fields of agriculture and animal husbandry on specific tasks. Specifically, these models are usually pre - trained on general image - text data crawled from the network. Therefore, when dealing with tasks in professional fields such as agriculture and animal husbandry, due to domain shift, their performance is often not satisfactory. ### Main problems: 1. **Lack of professional image - text data**: The fields of agriculture and animal husbandry lack comprehensive image - text data sources. Most of the existing data sets are limited to narrow tasks (such as disease classification) and only contain images and task - specific information (such as class names), which limits their application in vision - language pre - training. 2. **Need for fine - grained feature learning**: Many downstream tasks in agriculture and animal husbandry require learning subtle visual features. For example, when identifying plant diseases, it is necessary to distinguish small color changes on leaves or subtle differences in rust spots. Traditional contrastive learning methods may not be sufficient to capture these details. ### Solutions: To solve the above problems, the author proposes the AgriCLIP framework, which mainly includes the following two aspects: 1. **Construct a large - scale image - text data set ALive**: - The author collected 25 classification data sets covering crops, livestock and fisheries, and constructed a large - scale image - text data set ALive containing approximately 600,000 images. - Use a customized prompt generation strategy, combine metadata and category information, and generate diverse descriptive texts through GPT - 4 to enrich the context information of images. 2. **Design a specialized training pipeline**: - **Semantic feature learning**: Through the contrastive learning method, further pre - train the visual and text encoders of CLIP on the ALive data set to capture global semantic features. - **Fine - grained feature learning**: Adopt the DINO self - supervised learning method to enhance the learning ability of the visual encoder for local fine - grained features. - **Cross - modal alignment**: Align the visual encoder and the text encoder to make the model have zero - shot classification ability. ### Experimental results: The experimental results show that the zero - shot classification performance of AgriCLIP on 20 downstream tasks is significantly better than that of the standard CLIP and its further pre - trained version on the ALive data set. Specifically, the average zero - shot classification accuracy of AgriCLIP is 7.8% higher than that of the standard CLIP, demonstrating its effectiveness in the fields of agriculture and animal husbandry. Through these improvements, AgriCLIP can better handle complex vision - language tasks in the fields of agriculture and animal husbandry, especially in cases where fine - grained feature learning is required.

AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

WildCLIP: Scene and Animal Attribute Retrieval from Camera Trap Data with Domain-Adapted Vision-Language Models

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Improving CLIP Training with Language Rewrites

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement