Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Chih-Hsuan Yang,Benjamin Feuer,Zaki Jubery,Zi K. Deng,Andre Nakkab,Md Zahid Hasan,Shivani Chiranjeevi,Kelly Marshall,Nirmal Baishnab,Asheesh K Singh,Arti Singh,Soumik Sarkar,Nirav Merchant,Chinmay Hegde,Baskar Ganapathysubramanian
2024-06-26
Abstract:We introduce Arboretum, the largest publicly accessible dataset designed to advance AI for biodiversity applications. This dataset, curated from the iNaturalist community science platform and vetted by domain experts to ensure accuracy, includes 134.6 million images, surpassing existing datasets in scale by an order of magnitude. The dataset encompasses image-language paired data for a diverse set of species from birds (Aves), spiders/ticks/mites (Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi), snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource for multimodal vision-language AI models for biodiversity assessment and agriculture research. Each image is annotated with scientific names, taxonomic details, and common names, enhancing the robustness of AI model training. We showcase the value of Arboretum by releasing a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations across life stages, rare species, confounding species, and various levels of the taxonomic hierarchy. We anticipate that Arboretum will spur the development of AI models that can enable a variety of digital tools ranging from pest control strategies, crop monitoring, and worldwide biodiversity assessment and environmental conservation. These advancements are critical for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. Arboretum is publicly available, easily accessible, and ready for immediate use. Please see the \href{<a class="link-external link-https" href="https://baskargroup.github.io/Arboretum/" rel="external noopener nofollow">this https URL</a>}{project website} for links to our data, models, and code.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper introduces a large multimodal dataset called Arboretum, which aims to promote the development of artificial intelligence in biodiversity applications. This dataset contains approximately 134.6 million images, one order of magnitude larger than the existing largest datasets, mainly sourced from the iNaturalist platform and verified by domain experts to ensure accuracy. The dataset covers images and language pairs of multiple species such as birds, spiders, insects, plants, fungi, snails, and reptiles, making it suitable for training multimodal visual language AI models to support biodiversity and agricultural research. Each image has annotations of scientific names, taxonomic details, and common names, enhancing the robustness of model training. In the paper, the authors release the CLIP model suite trained on 40 million annotated images and propose new benchmark tests to evaluate zero-shot learning, life cycle stage recognition, rare species identification, confusable species identification, and accuracy at different classification levels. The release of the Arboretum dataset is expected to facilitate the development of AI models applicable to pest control, crop monitoring, global biodiversity, and environmental conservation. These advancements are crucial for ensuring food security, protecting ecosystems, and mitigating the impacts of climate change. The dataset is open, easily accessible, and available for immediate use, with relevant links, models, and code provided. The paper also discusses the challenges faced by existing AI methods in biodiversity applications, such as high data creation costs, limited coverage range, and poor model generalization ability, and compares them with other datasets. Additionally, the paper introduces methods for data collection and preprocessing, as well as a new model (ARBOR CLIP) for constructing new benchmarks and evaluating model performance. In conclusion, the launch of the Arboretum dataset provides a large quantity of high-quality resources for AI research in the field of biodiversity, and is expected to drive the development of related technologies.