WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals

Lukáš Adam,Vojtěch Čermák,Kostas Papafitsoros,Lukas Picek
2024-06-17
Abstract:We introduce a new wildlife re-identification dataset WildlifeReID-10k with more than 214k images of 10k individual animals. It is a collection of 30 existing wildlife re-identification datasets with additional processing steps. WildlifeReID-10k contains animals as diverse as marine turtles, primates, birds, African herbivores, marine mammals and domestic animals. Due to the ubiquity of similar images in datasets, we argue that the standard (random) splits into training and testing sets are inadequate for wildlife re-identification and propose a new similarity-aware split based on the similarity of extracted features. To promote fair method comparison, we include similarity-aware splits both for closed-set and open-set settings, use MegaDescriptor - a foundational model for wildlife re-identification - for baseline performance and host a leaderboard with the best results. We publicly publish the dataset and the codes used to create it in the wildlife-datasets library, making WildlifeReID-10k both highly curated and easy to use.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address several key issues in wildlife re-identification datasets: 1. **Diversity and Scale of Datasets**: - Current wildlife re-identification datasets typically contain a limited number of photos and individual animals, which restricts their standalone value. - The datasets lack a unified format and documentation, leading to time-consuming initial analysis. 2. **Inconsistency in Training-Testing Split**: - Existing datasets often do not have standardized training-testing split methods or use random splits, which can result in similar images appearing in both the training and testing sets, thus overestimating the performance of re-identification methods. - Random splits overlook similar images generated during data collection, such as multiple photos taken during a single human-animal encounter or consecutive frames extracted from a video. 3. **Fair Comparison of Methods**: - Newly proposed algorithms are usually evaluated on only a subset of available datasets and are not compared with previous work, making it difficult to compare the performance of different methods. ### Solutions To address the above issues, the paper proposes the following solutions: 1. **Creating a Large Comprehensive Dataset**: - Introduced a new wildlife re-identification dataset **WildlifeReID-10k**, containing over 214,000 images from 10,000 individual animals. - This dataset integrates 30 existing wildlife re-identification datasets and includes additional processing steps to improve data quality and diversity. 2. **Proposing a New Similarity-Aware Split Method**: - Proposed a new split method based on image feature similarity (similarity-aware split) to prevent similar images from appearing in both the training and testing sets. - Used clustering algorithms (such as DBSCAN) to group similar images and ensure these images only appear in the training set, thereby reducing information leakage. 3. **Providing Baseline Performance and Leaderboard**: - Used MegaDescriptor as the base model to provide baseline performance results. - Recorded the best results on a public leaderboard to promote fair comparison of methods. ### Summary The paper addresses issues in the diversity and training-testing split of existing datasets by creating a large-scale, high-quality wildlife re-identification dataset **WildlifeReID-10k** and proposing a new similarity-aware split method. These improvements help advance research in wildlife re-identification for computer vision scientists and ecologists.