Abstract:Biodiversity data are currently being generated at an unprecedented rate from deployed field monitoring sensors (e.g., wildlife and insect cameras, sound recorders, radars), citizen science observations, digitised museum collections, and biodiversity- and environmental-generated research. Deep neural networks have made it possible to automatically identify species on multimedia (e.g., image, sound, radar, DNA) with increasing accuracy and efficiency, a task that would otherwise be impossible for taxonomic experts to perform at the rate and scale at which these data are being generated. Artificial intelligence (AI) models can help understand biodiversity data and automate tasks. At Naturalis Biodiversity Center, we developed several AI species identification models using image or sound recognition for citizen science, collection management and biomonitoring purposes. We present here a pipeline for training large-scale AI species identification models combining multiple sources of image training data that cover the most commonly encountered macro-organisms in Europe. The training pipeline is shown in Fig. 1. First, 45.4 million images from a total of 133,367 taxonomic names from six different data sources were pooled and mapped into 35.5 million images of 41,014 unique taxa using a custom-developed software tool TaxonMap. From the pooled data, shared models for eight different plants, animals and fungi species groups were trained using imbalance mitigating techniques to increase data efficiency. Subsequently, the shared models were finetuned using data from each source to adapt to its species frequency distribution and local taxonomies. For the three insect species groups, specialised models that predict the life stage of the organism were also trained. As shown in Fig. 2, the resulting species identification algorithm consists of 39 specialised models, which includes one main model and eight species group models, each customised for four European organisations, plus three life-stage models for the three insect groups. Measured on the same test data, which have not been used for training the models, the 2023 large-scale multi-source model (MSM), fine-tuned and customised for Observation.org, showed significant performance improvement compared to the 2021 model trained with only their own data. As shown in Fig. 3, t he 2023 model not only includes more taxa, but also identifies species with greater accuracy, especially for the rarer taxa, as measured by average recall. Fig. 4 presents the effect of class imbalance on model performance by showing the relationship between the number of training samples (right vertical axis) and the accuracy and average recall including the number of most common taxa. Analysis is performed in the mollusks species group for Observation.org. The right vertical axis shows the strong class imbalance in data out of the around 800 taxa in this species group, with the rarest taxon having only ten training images and the most common taxon having about a thousand training images. Measured on the 2023 test data, the average accuracy for all taxa (right-most point in the figure) in this species group was 86%, with the average recall being 64%. By including rarer taxa, average recall drops as expected, while accuracy drops less, as accuracy is mostly influenced by common taxa. Fig. 5 shows how the analysis of Fig. 4 can be used to compare different models, in this case the multi-source 2023 arthropod model customised for Norway vs. the 2022 arthropod model trained on only the Norwegian data. The 2023 model showed an improved accuracy of 5% for all taxa included, and an even larger improvement on the identification of rarer taxa of about 11%. The large-scale species identification model, with its 39 specialised models, has been deployed as an auto-scaling web service used by seven (in 2024) biodiversity portals in Europe, and has performed about 65 million identifications in the past 12 months (Aug 2023–Aug 2024), allowing citizen scientists and interested public to identify European flora and fauna using web interface and/or interactive mobile apps, increasing the speed of collecting citizen science data. Continuous developments of advanced features for this large-scale species identification model are taking place. In the 2023 model, we have implemented explicit probability calibration of AI identifications, allowing automatic validation. Auto-validation is a feature that suggests those AI identifications of the data with low risk, without the need for expert review. Advanced features to be implemented in the 2024 model include providing prediction probabilities at all taxonomic levels (only species level in the 2023 model) and developing life-stage models for other species groups. Planned advanced features for 2025 include context-aware identification (using location, time and neighbouring species to improve identification), rejecting invalid and unusable input such as selfies, poor quality and unknown taxa (Hogeweg 2024), and image search (returning images similar to the input image). We have developed this large-scale multi-source model using citizen science observation data from several European biodiversity portals. This AI training pipeline can be applied to develop other large-scale, multi-source algorithms for biodiversity monitoring with sensor input (e.g., insect cameras), digitised museum collection identification as part of the digitisation and collection management workflow, and sound recognition models for citizen science and biomonitoring.

Data-centric AI approach for automated wildflower monitoring

Using photographs and deep neural networks to understand flowering phenology and diversity in mountain meadows

A deep learning pipeline for time-lapse camera monitoring of insects and their floral environments

AI Species Identification Using Image and Sound Recognition for Citizen Science, Collection Management and Biomonitoring: From Training Pipeline to Large-Scale Models

Automatic flower detection and phenology monitoring using time‐lapse cameras and deep learning

Towards ML Methods for Biodiversity: A Novel Wild Bee Dataset and Evaluations of XAI Methods for ML-Assisted Rare Species Annotations

Weed database development: An updated survey of public weed datasets and cross-season weed detection adaptation

Multi-format open-source weed image dataset for real-time weed identification in precision agriculture

Data on three-year flowering intensity monitoring in an apple orchard: A collection of RGB images acquired from unmanned aerial vehicles

Insect Identification in the Wild: The AMI Dataset

UAV and a Deep Convolutional Neural Network for Monitoring Invasive Alien Plants in the Wild

Floralens: a Deep Learning Model for the Portuguese Native Flora

WeedMap: A Large-Scale Semantic Weed Mapping Framework Using Aerial Multispectral Imaging and Deep Neural Network for Precision Farming

CWD30: A Comprehensive and Holistic Dataset for Crop Weed Recognition in Precision Agriculture

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning

Use of remote sensing and image processing for identification of wild orchids

Manually annotated and curated Dataset of diverse Weed Species in Maize and Sorghum for Computer Vision

Automatedly identify dryland threatened species at large scale by using deep learning

Deep Learning-Based Object Detection System for Identifying Weeds Using UAS Imagery