TQRFormer: Tubelet query recollection transformer for action detection

Xiangyang Wang,Kun Yang,Qiang Ding,Rui Wang,Jinhua Sun
DOI: https://doi.org/10.1016/j.imavis.2024.105059
IF: 3.86
2024-04-30
Image and Vision Computing
Abstract:Spatial and temporal action detection aims to precisely locate actions while predicting their respective categories. The existing solution, TubeR (Zhao et al., 2022), is designed to directly detect action tubes in videos by recognizing and localizing actions using a unified representation. However, a potential challenge arises during the decoding stage, leading to a gradual decrease in the model's performance in action detection, specifically in terms of the confidence associated with detected actions. In this paper, we propose TQRFormer: Tubelet Query Recollection Transformer, enabling the subsequent decoder to obtain information from the previous stage. Specifically, we designed Query Recollection Attention to correct errors and output the synthesized results, effectively breaking the limitations of sequential decoding. During the training stage, TubeR (Zhao et al., 2022) generates a limited number of positive sample queries through a one-to-one matching strategy, potentially impacting the effectiveness of training with positive samples. To enhance the quantity of positive samples, we propose a stage matching approach that combines both one-to-many matching and one-to-one matching without additional queries. This approach serves to boost the overall number of positive samples for improved training outcomes. We also propose a more elegant classification head that contains the start and end frames of the small tubes information, eliminating the necessity for a separate action switch. The performance of TQRFormer is superior to previous state-of-the-art technologies on public action detection datasets, including AVA, UCF101–24, JHMDB-21 and MultiSports. The code will available at https://github.com/ykyk000/TQRFormer .
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, software engineering,optics
What problem does this paper attempt to address?