Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Anindya Mondal,Sauradip Nag,Joaquin M Prada,Xiatian Zhu,Anjan Dutta

DOI: https://doi.org/10.1109/ICCVW60793.2023.00086

2024-01-10

Abstract:Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at <a class="link-external link-https" href="https://github.com/mondalanindya/MSQNet" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Image and Video Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing action recognition methods are usually targeted at specific actors (such as humans and animals). Due to the topological and appearance differences between different actors, actor - specific pose estimation is required, which not only increases the complexity of model design but also raises the maintenance cost. In addition, existing methods often only focus on the learning of visual modalities and single - label classification, while ignoring other available information sources (such as category name texts) and the situation where multiple actions occur simultaneously. To solve these problems, the author proposes a new method - "actor - agnostic multi - modal multi - label action recognition", providing a unified solution applicable to all types of actors, including humans and animals. The author further designs a new Multi - modal Semantic Query Network (MSQNet) model, which is implemented in Transformer - based object detection frameworks (such as DETR) and better represents action categories by leveraging visual and text modalities. The key advantage of this method is that it eliminates the need for actor - specific model design and completely does not require actor pose estimation. Experimental results show that the consistent performance of MSQNet on five public datasets is better than previous actor - specific methods, and the performance improvement on single - label and multi - label action recognition tasks for humans and animals is up to 50%.

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Cross-modality Online Distillation for Multi-View Action Recognition

Multi-Modal Multi-Action Video Recognition.

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Action Selection Learning for Multi-label Multi-view Action Recognition

Action Recognition with Actons

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Action Understanding with Multiple Classes of Actors

Multi-modality Fusion Network for Action Recognition.

Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation from Natural Language Query.

SMTDKD: A Semantic-Aware Multimodal Transformer Fusion Decoupled Knowledge Distillation Method for Action Recognition

Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

Actor and Action Modular Network for Text-Based Video Segmentation

Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

MAFormer: A cross-channel spatio-temporal feature aggregation method for human action recognition

Multi-Scale Adaptive Skeleton Transformer for action recognition

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition