AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

Muhammad Awais,Ali Husain Salem Abdulla Alharthi,Amandeep Kumar,Hisham Cholakkal,Rao Muhammad Anwer
2024-10-11
Abstract:Significant progress has been made in advancing large multimodal conversational models (LMMs), capitalizing on vast repositories of image-text data available online. Despite this progress, these models often encounter substantial domain gaps, hindering their ability to engage in complex conversations across new domains. Recent efforts have aimed to mitigate this issue, albeit relying on domain-specific image-text data to curate instruction-tuning data. However, many domains, such as agriculture, lack such vision-language data. In this work, we propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain. We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set, resulting in a 70k expert-tuning dataset called AgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights. We also develop AgroEvals for evaluation and compare {AgroGPT's} performance with large open and closed-source models. {AgroGPT} excels at identifying fine-grained agricultural concepts, can act as an agriculture expert, and provides helpful information for multimodal agriculture questions. The code, datasets, and models are available at <a class="link-external link-https" href="https://github.com/awaisrauf/agroGPT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of the lack of high - quality image - text data in the agricultural field, which leads to the poor performance of existing large - scale multimodal models (LMMs) in the agricultural field. Specifically: 1. **Existing problems**: - Although existing large - scale multimodal models perform well in the general field, there is a significant domain gap in specific fields such as agriculture, which makes them unable to accurately identify and answer complex agriculture - related questions. - The agricultural field lacks sufficient image - text pair data, making it difficult to create datasets specifically for instruction - tuning. 2. **Solutions**: - The paper proposes a new method to generate expert - level instruction - tuning data using vision - only datasets in the agricultural field. This method is achieved through the following steps: 1. **Data synthesis**: Extract information from image datasets in multiple agricultural fields and use state - of - the - art language models to generate rich, context - based image descriptions. 2. **Complex dialogue generation**: Use large - scale language models (LLMs) in combination with image attributes, external agricultural resource information, and context examples to generate multi - round complex dialogues. 3. **Simple question - answer pair generation**: Generate simple question - answer pairs according to the attributes of the image dataset to enhance the model's ability to recognize specific elements. 3. **Contributions**: - Constructed an expert - tuning dataset named **AgroInstruct** with a scale of 70,000, covering complex multi - round dialogues, simple question - answer pairs, and image descriptions. - Developed **AgroGPT**, an efficient multimodal dialogue model in the agricultural field, which can conduct complex dialogues based on agricultural images and provide useful insights. - Proposed **AgroEvals**, a visual question - answering framework for evaluating the performance of models in the agricultural field. Through these methods, the paper aims to bridge the multimodal data gap in the agricultural field and improve the performance and practicality of multimodal models in the agricultural field.