POLO -- Point-based, multi-class animal detection

Giacomo May,Emanuele Dalsasso,Benjamin Kellenberger,Devis Tuia
2024-10-16
Abstract:Automated wildlife surveys based on drone imagery and object detection technology are a powerful and increasingly popular tool in conservation biology. Most detectors require training images with annotated bounding boxes, which are tedious, expensive, and not always unambiguous to create. To reduce the annotation load associated with this practice, we develop POLO, a multi-class object detection model that can be trained entirely on point labels. POLO is based on simple, yet effective modifications to the YOLOv8 architecture, including alterations to the prediction process, training losses, and post-processing. We test POLO on drone recordings of waterfowl containing up to multiple thousands of individual birds in one image and compare it to a regular YOLOv8. Our experiments show that at the same annotation cost, POLO achieves improved accuracy in counting animals in aerial imagery.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reduce the annotation cost when conducting wildlife surveys based on drone images while improving the accuracy of animal counting. Specifically: 1. **High Annotation Cost Problem**: Traditional object detection models (such as YOLOv8) require a large number of training images with annotated bounding boxes. These annotation tasks are time - consuming, expensive, and in some cases difficult to annotate clearly. 2. **Small - Object Detection Challenges**: In the images taken by drones, animals are usually very small (only a few pixels), and may be partially occluded or deformed due to perspective and motion blur. This makes the quality of automatically created bounding boxes poor, affecting the detection accuracy. To solve these problems, the authors developed a multi - class object detection model named POLO (Point - based, multi - class animal detection), which can be trained entirely based on point labels instead of relying on bounding box annotations. By modifying the prediction process, loss function, and post - processing steps in the YOLOv8 architecture, POLO can achieve higher accuracy in animal counting with the same annotation cost. ### Main Contributions - **Point - Label Training**: POLO can be directly trained using point labels, reducing the annotation workload. - **Improved YOLOv8 Architecture**: Simple but effective modifications were made to YOLOv8, including output dimensions, loss functions, and post - processing methods. - **Experimental Verification**: Tests were carried out on a drone - image data set of Izembek Lagoon in Alaska, proving that POLO outperforms the traditional YOLOv8 model in the counting tasks of multiple species. ### Formula Summary - **Center - Point Prediction Formula**: \[ \hat{p}_x=\sigma(a_1)\cdot2^{- 0.5}+c_x \] \[ \hat{p}_y=\sigma(a_2)\cdot2^{- 0.5}+c_y \] where \(\hat{p}_x\) and \(\hat{p}_y\) are the predicted coordinates, \(a_1\) and \(a_2\) are the activation values of the grid cell in the first and second output channels, \(\sigma(\cdot)\) is the Sigmoid function, and \(c_x\) and \(c_y\) are the coordinates of the upper - left corner of the grid cell. - **Average Hausdorff Distance Loss**: \[ L_{AH}(\hat{P}, P)=\frac{1}{|P|}\sum_{i = 1}^{|P|}\min_{\hat{p}\in\hat{P}}d(\hat{p}, p_i)+\frac{1}{|\hat{P}|}\sum_{j = 1}^{|\hat{P}|}\min_{p\in P}d(\hat{p}_j, p) \] - **Mean - Square - Error Loss**: \[ L_{MSE}=\frac{1}{|P|}\sum_{i = 1}^{|P|}\|p_i-\hat{p}_i\|_2^2 \] - **Distance - over - Radius (DoR) Indicator**: \[ DoR=\frac{d(\hat{p}, p)}{r_c} \] where \(d(\hat{p}, p)\) is the Euclidean distance between the predicted point and the real position, and \(r_c\) is the radius value specified by the user for each object/animal category. Through these improvements, POLO not only reduces the annotation cost but also performs well in the counting tasks of multiple species, especially having an advantage in dealing with small targets and dense scenes.