TxVAD: Improved Video Action Detection by Transformers

Zhenyu Wu,Zhou Ren,Yi Wu,Zhangyang Wang,Gang Hua
DOI: https://doi.org/10.1145/3503161.3547992
2022-01-01
Abstract:Video action detection aims to localize persons in both space and time from video sequences and recognize their actions. Most existing methods are composed of many specialized components, e.g., pretrained person/object detectors, region proposal networks (RPN), memory banks, and so on. This paper proposes a conceptually simple paradigm for video action detection using Transformers, which effectively removes the need for specialized components and achieves superior performance. Our proposed Transformer-based Video Action Detector (TxVAD) utilizes two Transformers to capture scene context information and long-range spatio-temporal context information, for person localization and action classification, respectively. Through extensive experiments on four public datasets, AVA, AVA-Kinetics, JHMDB-21, and UCF101-24, we show that our conceptually simple paradigm has achieved state-of-the-art performance for video action detection task, without using pre-trained person/object detectors, RPN, or memory bank.
What problem does this paper attempt to address?