Abstract:We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage

What problem does this paper attempt to address?

This paper attempts to solve the problem of human action recognition in UAV videos. Specifically, UAV (Unmanned Aerial Vehicle) videos have unique challenges, such as small human targets and limited labeled data, which make the performance of existing action recognition methods on UAV videos poor. The paper proposes a new self - supervised pre - training algorithm SOAR (Self - supervision Optimized UAV Action Recognition with Efficient Object - Aware Pretraining). By introducing an object - aware masking strategy and a loss function, it optimizes the pre - training process, thereby improving the action recognition performance in downstream tasks. ### Main Problems 1. **Small Human Targets**: Due to the altitude of the UAV, the proportion of the human body in the video frame is very small, which makes it difficult for the model to capture the detailed points of human movement and increases the risk of relying on background features. 2. **Limited Labeled Data**: Obtaining high - quality labeled data is particularly difficult for UAV - based perception tasks. The unique perspective, moving camera and small human targets make the labeling process complex and it is difficult to generate a robust data set. 3. **Limitations of Existing Methods**: Existing methods usually introduce object knowledge only in the fine - tuning stage, which requires additional steps, such as generating bounding boxes or feature alignment, increasing the computational requirements and slowing down the inference speed. ### Solutions 1. **Object - Aware Masking Strategy**: SOAR proposes a new object - aware masking strategy. By retaining patches related to the object to guide the masking process, it ensures that the model can learn object - related spatio - temporal patterns more effectively. 2. **Object - Aware Loss Function**: SOAR introduces an object - aware loss function. By using object information to adjust the reconstruction loss, it prevents the model from being biased towards background patches. 3. **Efficient Pre - training Algorithm**: SOAR improves the action recognition accuracy in downstream tasks by reducing memory usage and accelerating the pre - training process, without increasing the inference overhead, providing a faster end - to - end inference process. ### Experimental Results - **NEC - Drone Data Set**: SOAR achieves a Top - 1 accuracy of 84.6% on the ViT - B backbone network and 90.4% on the ViT - L backbone network. - **UAV - Human Data Set**: SOAR achieves a Top - 1 accuracy of 66.4% on the ViT - B backbone network and 76.4% on the ViT - L backbone network. - **Inference Time**: SOAR processes videos at a speed of 18.7 milliseconds per video on the RTX A5000 GPU, which is 2 times faster than AZTR and 5 times faster than MITFAS. ### Conclusion SOAR significantly improves the human action recognition performance in UAV videos by introducing an object - aware masking strategy and a loss function, reduces the dependence on large - scale labeled data, and performs excellently in terms of inference speed. These improvements make SOAR more efficient and practical in practical applications.

SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

SOAR: Scene-debiasing Open-set Action Recognition

SOAR: Simultaneous Exploration and Photographing with Heterogeneous UAVs for Fast Autonomous Reconstruction

An improved deep learning method for flying object detection and recognition

A novel small object detection algorithm for UAVs based on YOLOv5

Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition

SOAR: Advancements in Small Body Object Detection for Aerial Imagery Using State Space Models and Programmable Gradients

ARF-YOLOv8: a novel real-time object detection model for UAV-captured images detection

AFE-YOLOv8: A Novel Object Detection Model for Unmanned Aerial Vehicle Scenes with Adaptive Feature Enhancement

Lightweight unmanned aerial vehicle object detection algorithm based on improved YOLOv8

YOLOv7-UAV: An Unmanned Aerial Vehicle Image Object Detection Algorithm Based on Improved YOLOv7

RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring

Object Detection for UAV Aerial Scenarios Based on Vectorized IOU

SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild

Target Detection Method of UAV Aerial Imagery Based on Improved YOLOv5

MITFAS: Mutual Information based Temporal Feature Alignment and Sampling for Aerial Video Action Recognition

Learnable Cross-Scale Sparse Attention Guided Feature Fusion for UAV Object Detection

An Enhanced Target Detection Algorithm for Maritime Search and Rescue Based on Aerial Images

SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images

UAV Target Detection Algorithm Based on Improved YOLOv8