Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Anindya Mondal,Sauradip Nag,Joaquin M Prada,Xiatian Zhu,Anjan Dutta
DOI: https://doi.org/10.1109/ICCVW60793.2023.00086
2024-01-10
Abstract:Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at <a class="link-external link-https" href="https://github.com/mondalanindya/MSQNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing action recognition methods are usually targeted at specific actors (such as humans and animals). Due to the topological and appearance differences between different actors, actor - specific pose estimation is required, which not only increases the complexity of model design but also raises the maintenance cost. In addition, existing methods often only focus on the learning of visual modalities and single - label classification, while ignoring other available information sources (such as category name texts) and the situation where multiple actions occur simultaneously. To solve these problems, the author proposes a new method - "actor - agnostic multi - modal multi - label action recognition", providing a unified solution applicable to all types of actors, including humans and animals. The author further designs a new Multi - modal Semantic Query Network (MSQNet) model, which is implemented in Transformer - based object detection frameworks (such as DETR) and better represents action categories by leveraging visual and text modalities. The key advantage of this method is that it eliminates the need for actor - specific model design and completely does not require actor pose estimation. Experimental results show that the consistent performance of MSQNet on five public datasets is better than previous actor - specific methods, and the performance improvement on single - label and multi - label action recognition tasks for humans and animals is up to 50%.