Citrus Diseases and Pests Image-Text Retrieval Based on Multi-Modal Transformer

Yaoguang Wei,Xingyu Liu,Dong An,Jincun Liu
DOI: https://doi.org/10.1109/hdis60872.2023.10499461
2023-01-01
Abstract:Current agricultural information technologies mostly rely on single-modal data and lack the connections between images and text. For instance, image-based disease recognition requires first identifying the disease in the image and then retrieving relevant knowledge, which is less efficient in information retrieval and prone to information errors. In this paper, we apply cross-modal retrieval technology to the agricultural information processing field and propose a deep learning framework that realizes cross-modal image and text retrieval of citrus diseases and pests. Specifically, we use a CNN backbone to extract global spatial features of the image instead of Faster R-CNN used by mainstream methods for visual embedding. We fine-tuned RoBERTa to encode the language information of a sentence and trained new embedding representation for keywords in citrus domain sentences, then we used a multi-modal Transformer to learn cross-modal attention for different modal data and drive the model to learn how to perceive the similarity between input samples through matching loss and the label loss. We conducted experiments and ablation studies on a self-collected cross-modal retrieval dataset of citrus diseases and pests and compared them with mainstream methods, the results showed that our method achieves satisfactory results in image-to-text and text-to-image two retrieval tasks.
What problem does this paper attempt to address?