Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Liang Xu,Cuiling Lan,Wenjun Zeng,Cewu Lu
DOI: https://doi.org/10.48550/arXiv.2110.14994
2022-05-10
Abstract:Skeleton data carries valuable motion information and is widely explored in human action recognition. However, not only the motion information but also the interaction with the environment provides discriminative cues to recognize the action of persons. In this paper, we propose a joint learning framework for mutually assisted "interacted object localization" and "human action recognition" based on skeleton data. The two tasks are serialized together and collaborate to promote each other, where preliminary action type derived from skeleton alone helps improve interacted object localization, which in turn provides valuable cues for the final human action recognition. Besides, we explore the temporal consistency of interacted object as constraint to better localize the interacted object with the absence of ground-truth labels. Extensive experiments on the datasets of SYSU-3D, NTU60 RGB+D, Northwestern-UCLA and UAV-Human show that our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition. Visualization results show that our method can also provide reasonable interacted object localization results.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Challenges in human action recognition**: - Skeletal data may be insufficient or ambiguous in some cases. For example, when a person stands there talking or watching TV, their action categories may be very different, but the skeletal sequences are almost the same. - Existing methods ignore the interaction between humans and the environment, and these interactions are crucial for modeling human actions. - For fine - grained action recognition tasks, such as analyzing the behavior of customers picking goods in a store, it is difficult to handle only with skeletal data. 2. **Challenges in interactive object localization**: - The localization of interactive objects in videos is an open and less - explored problem because it is very expensive and cumbersome to label the bounding boxes of human - object pairs in interaction. - Although the skeletal sequence provides some clues for localizing interactive objects, such as the distance between the human skeleton and the object, active human body parts, etc., these clues may not be sufficient to accurately localize the interactive objects without action type information. To address the above challenges, the paper proposes a joint learning framework that combines skeletal data and interactive object localization to enhance human action recognition. Specifically, the framework solves the problems in the following ways: - **Preliminary action classification assists interactive object localization**: Use skeletal data to generate preliminary action classification results, which can help to localize interactive objects more accurately. - **Interactive objects assist human action recognition**: Use the information of the localized interactive objects to further improve the performance of action recognition. - **Temporal consistency constraint**: By exploring the temporal consistency characteristics of interactive objects, it is possible to better localize interactive objects even in the absence of ground - truth labels. In this way, the paper aims to improve the robustness and accuracy of human action recognition while achieving unsupervised interactive object localization.