Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Ioanna Ntinou,Enrique Sanchez,Georgios Tzimiropoulos
2024-05-23
Abstract:Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?