Abstract:Significant progress has been made in advancing large multimodal conversational models (LMMs), capitalizing on vast repositories of image-text data available online. Despite this progress, these models often encounter substantial domain gaps, hindering their ability to engage in complex conversations across new domains. Recent efforts have aimed to mitigate this issue, albeit relying on domain-specific image-text data to curate instruction-tuning data. However, many domains, such as agriculture, lack such vision-language data. In this work, we propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain. We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set, resulting in a 70k expert-tuning dataset called AgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights. We also develop AgroEvals for evaluation and compare {AgroGPT's} performance with large open and closed-source models. {AgroGPT} excels at identifying fine-grained agricultural concepts, can act as an agriculture expert, and provides helpful information for multimodal agriculture questions. The code, datasets, and models are available at <a class="link-external link-https" href="https://github.com/awaisrauf/agroGPT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problem of the lack of high - quality image - text data in the agricultural field, which leads to the poor performance of existing large - scale multimodal models (LMMs) in the agricultural field. Specifically: 1. **Existing problems**: - Although existing large - scale multimodal models perform well in the general field, there is a significant domain gap in specific fields such as agriculture, which makes them unable to accurately identify and answer complex agriculture - related questions. - The agricultural field lacks sufficient image - text pair data, making it difficult to create datasets specifically for instruction - tuning. 2. **Solutions**: - The paper proposes a new method to generate expert - level instruction - tuning data using vision - only datasets in the agricultural field. This method is achieved through the following steps: 1. **Data synthesis**: Extract information from image datasets in multiple agricultural fields and use state - of - the - art language models to generate rich, context - based image descriptions. 2. **Complex dialogue generation**: Use large - scale language models (LLMs) in combination with image attributes, external agricultural resource information, and context examples to generate multi - round complex dialogues. 3. **Simple question - answer pair generation**: Generate simple question - answer pairs according to the attributes of the image dataset to enhance the model's ability to recognize specific elements. 3. **Contributions**: - Constructed an expert - tuning dataset named **AgroInstruct** with a scale of 70,000, covering complex multi - round dialogues, simple question - answer pairs, and image descriptions. - Developed **AgroGPT**, an efficient multimodal dialogue model in the agricultural field, which can conduct complex dialogues based on agricultural images and provide useful insights. - Proposed **AgroEvals**, a visual question - answering framework for evaluating the performance of models in the agricultural field. Through these methods, the paper aims to bridge the multimodal data gap in the agricultural field and improve the performance and practicality of multimodal models in the agricultural field.

AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

ChatAgri: Exploring Potentials of ChatGPT on Cross-linguistic Agricultural Text Classification

ShizishanGPT: An Agricultural Large Language Model Integrating Tools and Resources

RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture

Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases

GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using Large Language Models

AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment

LLMs for Enhanced Agricultural Meteorological Recommendations

Enhancing Agricultural Machinery Management through Advanced LLM Integration

ChatGPT in the context of precision agriculture data analytics

Toward a long-range map of human chromosomal band 22q11.

DeepG2P: Fusing Multi-Modal Data to Improve Crop Production

GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture

Exploring New Frontiers in Agricultural NLP: Investigating the Potential of Large Language Models for Food Applications

Harnessing Large Vision and Language Models in Agriculture: A Review

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Enhancing Named Entity Recognition for Agricultural Commodity Monitoring with Large Language Models

Enhanced Infield Agriculture with Interpretable Machine Learning Approaches for Crop Classification