Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition

Gege Shi,Xueyang Fu,Chengzhi Cao,Zheng-Jun Zha
DOI: https://doi.org/10.1145/3581783.3611799
2023-01-01
Abstract:Recognizing activities with Unmanned Aerial Vehicles (UAVs) is essential for many applications, while existing video recognition methods are mainly designed for ground cameras and do not account for UAV changing attitudes and fast motion. This creates spatial misalignment of small objects between frames, leading to inaccurate visual movement in drone videos. Additionally, camera motion relative to objects in the video causes relative movements that visually affect object motion and can result in misunderstandings of video content. To address these issues, we present a novel framework named Attentional Spatial and Adaptive Temporal Relations Modeling. First, to mitigate the spatial misalignment of small objects between frames, we design an Attentional Patch-level Spatial Enrichment (APSE) module that models dependencies among patches and enhances patch-level features. Then, we propose a Multi-scale Temporal and Spatial Mixer (MTSM) module that is capable of adapting to disturbances caused by the UAV flight and modeling various temporal clues. By integrating APSE and MTSM into a single model, our network can effectively and accurately capture spatiotemporal relations for UAV videos. Extensive experiments on several benchmarks demonstrate the superiority of our method over state-of-the-art approaches. For instance, our network achieves a classification accuracy of 68.1% with an absolute gain of 1.3% compared to FuTH-Net on the ERA dataset.
What problem does this paper attempt to address?