SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

Ruiqi Xian,Xiyang Wu,Tianrui Guan,Xijun Wang,Boqing Gong,Dinesh Manocha
2024-09-27
Abstract:We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Robotics
What problem does this paper attempt to address?
This paper attempts to solve the problem of human action recognition in UAV videos. Specifically, UAV (Unmanned Aerial Vehicle) videos have unique challenges, such as small human targets and limited labeled data, which make the performance of existing action recognition methods on UAV videos poor. The paper proposes a new self - supervised pre - training algorithm SOAR (Self - supervision Optimized UAV Action Recognition with Efficient Object - Aware Pretraining). By introducing an object - aware masking strategy and a loss function, it optimizes the pre - training process, thereby improving the action recognition performance in downstream tasks. ### Main Problems 1. **Small Human Targets**: Due to the altitude of the UAV, the proportion of the human body in the video frame is very small, which makes it difficult for the model to capture the detailed points of human movement and increases the risk of relying on background features. 2. **Limited Labeled Data**: Obtaining high - quality labeled data is particularly difficult for UAV - based perception tasks. The unique perspective, moving camera and small human targets make the labeling process complex and it is difficult to generate a robust data set. 3. **Limitations of Existing Methods**: Existing methods usually introduce object knowledge only in the fine - tuning stage, which requires additional steps, such as generating bounding boxes or feature alignment, increasing the computational requirements and slowing down the inference speed. ### Solutions 1. **Object - Aware Masking Strategy**: SOAR proposes a new object - aware masking strategy. By retaining patches related to the object to guide the masking process, it ensures that the model can learn object - related spatio - temporal patterns more effectively. 2. **Object - Aware Loss Function**: SOAR introduces an object - aware loss function. By using object information to adjust the reconstruction loss, it prevents the model from being biased towards background patches. 3. **Efficient Pre - training Algorithm**: SOAR improves the action recognition accuracy in downstream tasks by reducing memory usage and accelerating the pre - training process, without increasing the inference overhead, providing a faster end - to - end inference process. ### Experimental Results - **NEC - Drone Data Set**: SOAR achieves a Top - 1 accuracy of 84.6% on the ViT - B backbone network and 90.4% on the ViT - L backbone network. - **UAV - Human Data Set**: SOAR achieves a Top - 1 accuracy of 66.4% on the ViT - B backbone network and 76.4% on the ViT - L backbone network. - **Inference Time**: SOAR processes videos at a speed of 18.7 milliseconds per video on the RTX A5000 GPU, which is 2 times faster than AZTR and 5 times faster than MITFAS. ### Conclusion SOAR significantly improves the human action recognition performance in UAV videos by introducing an object - aware masking strategy and a loss function, reduces the dependence on large - scale labeled data, and performs excellently in terms of inference speed. These improvements make SOAR more efficient and practical in practical applications.