Abstract:We introduce a new wildlife re-identification dataset WildlifeReID-10k with more than 214k images of 10k individual animals. It is a collection of 30 existing wildlife re-identification datasets with additional processing steps. WildlifeReID-10k contains animals as diverse as marine turtles, primates, birds, African herbivores, marine mammals and domestic animals. Due to the ubiquity of similar images in datasets, we argue that the standard (random) splits into training and testing sets are inadequate for wildlife re-identification and propose a new similarity-aware split based on the similarity of extracted features. To promote fair method comparison, we include similarity-aware splits both for closed-set and open-set settings, use MegaDescriptor - a foundational model for wildlife re-identification - for baseline performance and host a leaderboard with the best results. We publicly publish the dataset and the codes used to create it in the wildlife-datasets library, making WildlifeReID-10k both highly curated and easy to use.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address several key issues in wildlife re-identification datasets: 1. **Diversity and Scale of Datasets**: - Current wildlife re-identification datasets typically contain a limited number of photos and individual animals, which restricts their standalone value. - The datasets lack a unified format and documentation, leading to time-consuming initial analysis. 2. **Inconsistency in Training-Testing Split**: - Existing datasets often do not have standardized training-testing split methods or use random splits, which can result in similar images appearing in both the training and testing sets, thus overestimating the performance of re-identification methods. - Random splits overlook similar images generated during data collection, such as multiple photos taken during a single human-animal encounter or consecutive frames extracted from a video. 3. **Fair Comparison of Methods**: - Newly proposed algorithms are usually evaluated on only a subset of available datasets and are not compared with previous work, making it difficult to compare the performance of different methods. ### Solutions To address the above issues, the paper proposes the following solutions: 1. **Creating a Large Comprehensive Dataset**: - Introduced a new wildlife re-identification dataset **WildlifeReID-10k**, containing over 214,000 images from 10,000 individual animals. - This dataset integrates 30 existing wildlife re-identification datasets and includes additional processing steps to improve data quality and diversity. 2. **Proposing a New Similarity-Aware Split Method**: - Proposed a new split method based on image feature similarity (similarity-aware split) to prevent similar images from appearing in both the training and testing sets. - Used clustering algorithms (such as DBSCAN) to group similar images and ensure these images only appear in the training set, thereby reducing information leakage. 3. **Providing Baseline Performance and Leaderboard**: - Used MegaDescriptor as the base model to provide baseline performance results. - Recorded the best results on a public leaderboard to promote fair comparison of methods. ### Summary The paper addresses issues in the diversity and training-testing split of existing datasets by creating a large-scale, high-quality wildlife re-identification dataset **WildlifeReID-10k** and proposing a new similarity-aware split method. These improvements help advance research in wildlife re-identification for computer vision scientists and ecologists.

WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals

WildlifeDatasets: An open-source toolkit for animal re-identification

Multispecies Animal Re-ID Using a Large Community-Curated Dataset

Understanding the Impact of Training Set Size on Animal Re-identification

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

An Individual Identity-Driven Framework for Animal Re-Identification

Categorical Keypoint Positional Embedding for Robust Animal Re-Identification

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition

BuckTales : A multi-UAV dataset for multi-object tracking and re-identification of wild antelopes

Similarity learning networks for animal individual re-identification: an ecological perspective

Wild Terrestrial Animal Re-Identification Based on an Improved Locally Aware Transformer with a Cross-Attention Mechanism

SealID: Saimaa Ringed Seal Re-Identification Dataset

Amur Tiger Re-identification in the Wild

VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild

An open‐source general purpose machine learning framework for individual animal re‐identification using few‐shot learning

An Open-World, Diverse, Cross-Spatial-Temporal Benchmark for Dynamic Wild Person Re-Identification

Addressing the Elephant in the Room: Robust Animal Re-Identification with Unsupervised Part-Based Feature Alignment

Human-in-the-Loop Visual Re-ID for Population Size Estimation