AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

Xijun Wang,Ruiqi Xian,Tianrui Guan,Celso M. de Melo,Stephen M. Nogar,Aniket Bera,Dinesh Manocha
DOI: https://doi.org/10.1109/ICRA48891.2023.10160564
2023-03-03
Abstract:We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8.3-10.4% improvement on the UAV-Human dataset and 3.2% improvement on the Drone Action dataset.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the problem of human action recognition in videos captured by Unmanned Aerial Vehicles (UAVs). Specifically, the paper proposes a novel method that can efficiently perform video action recognition on edge devices (such as Qualcomm Robotics RB5) and desktop GPUs (such as Nvidia RTX A5000). #### Main Contributions 1. **Automatic Scaling Algorithm**: - A new automatic scaling algorithm is proposed, which can effectively identify targets and adjust their size to meet the memory or processor requirements of the device. - This algorithm is suitable for low-resolution, multi-scale, and moving camera scenarios. - By using automatic focusing, cropping, and scaling strategies to capture key action information, human action analysis becomes more robust. 2. **Temporal Reasoning Algorithm**: - A temporal reasoning algorithm is introduced, combining convolution and attention mechanisms to capture action information. - 3D convolution is performed on desktop GPUs, while (2D+1) convolution is executed on low-power edge devices to balance accuracy and inference speed. - The attention mechanism includes cross-attention and self-attention, providing linear computational complexity and capturing long-range spatio-temporal relationships. #### Experimental Results - On the RoCoG-v2 dataset, this method improved Top-1 accuracy by 6.1-7.4%. - On the UA V-Human dataset, Top-1 accuracy improved by 8.3-10.4%. - On the Drone Action dataset, Top-1 accuracy improved by 3.2%, reaching 95.9%. ### Conclusion and Future Work This paper proposes a method for action recognition in UAV videos, including an automatic scaling algorithm and a temporal reasoning algorithm. Although significant improvements have been achieved on multiple datasets, the method still has some limitations, such as reliance on the performance of localization methods and the assumption that there is only one human actor performing actions in the input video. Future work will focus on handling multiple actors and different lighting and weather conditions.