AI Species Identification Using Image and Sound Recognition for Citizen Science, Collection Management and Biomonitoring: From Training Pipeline to Large-Scale Models
Laurens Hogeweg,Ni Yan,Django Brunink,Khadija Ezzaki-Chokri,Wilfred Gerritsen,Rita Pucci,Burooj Ghani,Dan Stowell,Vincent Kalkman
DOI: https://doi.org/10.3897/biss.8.136839
2024-09-12
Biodiversity Information Science and Standards
Abstract:Biodiversity data are currently being generated at an unprecedented rate from deployed field monitoring sensors (e.g., wildlife and insect cameras, sound recorders, radars), citizen science observations, digitised museum collections, and biodiversity- and environmental-generated research. Deep neural networks have made it possible to automatically identify species on multimedia (e.g., image, sound, radar, DNA) with increasing accuracy and efficiency, a task that would otherwise be impossible for taxonomic experts to perform at the rate and scale at which these data are being generated. Artificial intelligence (AI) models can help understand biodiversity data and automate tasks. At Naturalis Biodiversity Center, we developed several AI species identification models using image or sound recognition for citizen science, collection management and biomonitoring purposes. We present here a pipeline for training large-scale AI species identification models combining multiple sources of image training data that cover the most commonly encountered macro-organisms in Europe. The training pipeline is shown in Fig. 1. First, 45.4 million images from a total of 133,367 taxonomic names from six different data sources were pooled and mapped into 35.5 million images of 41,014 unique taxa using a custom-developed software tool TaxonMap. From the pooled data, shared models for eight different plants, animals and fungi species groups were trained using imbalance mitigating techniques to increase data efficiency. Subsequently, the shared models were finetuned using data from each source to adapt to its species frequency distribution and local taxonomies. For the three insect species groups, specialised models that predict the life stage of the organism were also trained. As shown in Fig. 2, the resulting species identification algorithm consists of 39 specialised models, which includes one main model and eight species group models, each customised for four European organisations, plus three life-stage models for the three insect groups. Measured on the same test data, which have not been used for training the models, the 2023 large-scale multi-source model (MSM), fine-tuned and customised for Observation.org, showed significant performance improvement compared to the 2021 model trained with only their own data. As shown in Fig. 3, t he 2023 model not only includes more taxa, but also identifies species with greater accuracy, especially for the rarer taxa, as measured by average recall. Fig. 4 presents the effect of class imbalance on model performance by showing the relationship between the number of training samples (right vertical axis) and the accuracy and average recall including the number of most common taxa. Analysis is performed in the mollusks species group for Observation.org. The right vertical axis shows the strong class imbalance in data out of the around 800 taxa in this species group, with the rarest taxon having only ten training images and the most common taxon having about a thousand training images. Measured on the 2023 test data, the average accuracy for all taxa (right-most point in the figure) in this species group was 86%, with the average recall being 64%. By including rarer taxa, average recall drops as expected, while accuracy drops less, as accuracy is mostly influenced by common taxa. Fig. 5 shows how the analysis of Fig. 4 can be used to compare different models, in this case the multi-source 2023 arthropod model customised for Norway vs. the 2022 arthropod model trained on only the Norwegian data. The 2023 model showed an improved accuracy of 5% for all taxa included, and an even larger improvement on the identification of rarer taxa of about 11%. The large-scale species identification model, with its 39 specialised models, has been deployed as an auto-scaling web service used by seven (in 2024) biodiversity portals in Europe, and has performed about 65 million identifications in the past 12 months (Aug 2023–Aug 2024), allowing citizen scientists and interested public to identify European flora and fauna using web interface and/or interactive mobile apps, increasing the speed of collecting citizen science data. Continuous developments of advanced features for this large-scale species identification model are taking place. In the 2023 model, we have implemented explicit probability calibration of AI identifications, allowing automatic validation. Auto-validation is a feature that suggests those AI identifications of the data with low risk, without the need for expert review. Advanced features to be implemented in the 2024 model include providing prediction probabilities at all taxonomic levels (only species level in the 2023 model) and developing life-stage models for other species groups. Planned advanced features for 2025 include context-aware identification (using location, time and neighbouring species to improve identification), rejecting invalid and unusable input such as selfies, poor quality and unknown taxa (Hogeweg 2024), and image search (returning images similar to the input image). We have developed this large-scale multi-source model using citizen science observation data from several European biodiversity portals. This AI training pipeline can be applied to develop other large-scale, multi-source algorithms for biodiversity monitoring with sensor input (e.g., insect cameras), digitised museum collection identification as part of the digitisation and collection management workflow, and sound recognition models for citizen science and biomonitoring.