Abstract:We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8.3-10.4% improvement on the UAV-Human dataset and 3.2% improvement on the Drone Action dataset.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the problem of human action recognition in videos captured by Unmanned Aerial Vehicles (UAVs). Specifically, the paper proposes a novel method that can efficiently perform video action recognition on edge devices (such as Qualcomm Robotics RB5) and desktop GPUs (such as Nvidia RTX A5000). #### Main Contributions 1. **Automatic Scaling Algorithm**: - A new automatic scaling algorithm is proposed, which can effectively identify targets and adjust their size to meet the memory or processor requirements of the device. - This algorithm is suitable for low-resolution, multi-scale, and moving camera scenarios. - By using automatic focusing, cropping, and scaling strategies to capture key action information, human action analysis becomes more robust. 2. **Temporal Reasoning Algorithm**: - A temporal reasoning algorithm is introduced, combining convolution and attention mechanisms to capture action information. - 3D convolution is performed on desktop GPUs, while (2D+1) convolution is executed on low-power edge devices to balance accuracy and inference speed. - The attention mechanism includes cross-attention and self-attention, providing linear computational complexity and capturing long-range spatio-temporal relationships. #### Experimental Results - On the RoCoG-v2 dataset, this method improved Top-1 accuracy by 6.1-7.4%. - On the UA V-Human dataset, Top-1 accuracy improved by 8.3-10.4%. - On the Drone Action dataset, Top-1 accuracy improved by 3.2%, reaching 95.9%. ### Conclusion and Future Work This paper proposes a method for action recognition in UAV videos, including an automatic scaling algorithm and a temporal reasoning algorithm. Although significant improvements have been achieved on multiple datasets, the method still has some limitations, such as reliance on the performance of localization methods and the assumption that there is only one human actor performing actions in the input video. Future work will focus on handling multiple actors and different lighting and weather conditions.

AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

Adaptive Switching Spatial-Temporal Fusion Detection for Remote Flying Drones

MITFAS: Mutual Information based Temporal Feature Alignment and Sampling for Aerial Video Action Recognition

Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition

SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework

Detecting Human Actions in Drone Images Using YoloV5 and Stochastic Gradient Boosting

A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition

Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition

Unmanned aerial vehicles for human detection and recognition using neural-network model

Active Human Pose Estimation via an Autonomous UAV Agent

UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

AERO: AI-Enabled Remote Sensing Observation with Onboard Edge Computing in UAVs

Action Machine: Rethinking Action Recognition in Trimmed Videos

A Multi-viewpoint Outdoor Dataset for Human Action Recognition

Learning Discriminative and Robust Representations for UAV-View Skeleton-Based Action Recognition

Zoom-and-Reasoning: Joint Foreground Zoom and Visual-Semantic Reasoning Detection Network for Aerial Images.

ARF-YOLOv8: a novel real-time object detection model for UAV-captured images detection

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

FAR: Fourier Aerial Video Recognition

Joint inference of groups, events and human roles in aerial videos