Abstract:Skeleton data carries valuable motion information and is widely explored in human action recognition. However, not only the motion information but also the interaction with the environment provides discriminative cues to recognize the action of persons. In this paper, we propose a joint learning framework for mutually assisted "interacted object localization" and "human action recognition" based on skeleton data. The two tasks are serialized together and collaborate to promote each other, where preliminary action type derived from skeleton alone helps improve interacted object localization, which in turn provides valuable cues for the final human action recognition. Besides, we explore the temporal consistency of interacted object as constraint to better localize the interacted object with the absence of ground-truth labels. Extensive experiments on the datasets of SYSU-3D, NTU60 RGB+D, Northwestern-UCLA and UAV-Human show that our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition. Visualization results show that our method can also provide reasonable interacted object localization results.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Challenges in human action recognition**: - Skeletal data may be insufficient or ambiguous in some cases. For example, when a person stands there talking or watching TV, their action categories may be very different, but the skeletal sequences are almost the same. - Existing methods ignore the interaction between humans and the environment, and these interactions are crucial for modeling human actions. - For fine - grained action recognition tasks, such as analyzing the behavior of customers picking goods in a store, it is difficult to handle only with skeletal data. 2. **Challenges in interactive object localization**: - The localization of interactive objects in videos is an open and less - explored problem because it is very expensive and cumbersome to label the bounding boxes of human - object pairs in interaction. - Although the skeletal sequence provides some clues for localizing interactive objects, such as the distance between the human skeleton and the object, active human body parts, etc., these clues may not be sufficient to accurately localize the interactive objects without action type information. To address the above challenges, the paper proposes a joint learning framework that combines skeletal data and interactive object localization to enhance human action recognition. Specifically, the framework solves the problems in the following ways: - **Preliminary action classification assists interactive object localization**: Use skeletal data to generate preliminary action classification results, which can help to localize interactive objects more accurately. - **Interactive objects assist human action recognition**: Use the information of the localized interactive objects to further improve the performance of action recognition. - **Temporal consistency constraint**: By exploring the temporal consistency characteristics of interactive objects, it is possible to better localize interactive objects even in the absence of ground - truth labels. In this way, the paper aims to improve the robustness and accuracy of human action recognition while achieving unsupervised interactive object localization.

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Action Recognition Based on Global Optimal Similarity Measuring

Human motion segmentation using collaborative representations of 3D skeletal sequences.

A Skeleton-Based Assembly Action Recognition Method with Feature Fusion for Human-Robot Collaborative Assembly

Multi-Stream Interaction Networks for Human Action Recognition

Online Robust Action Recognition Based on a Hierarchical Model

An effective representation for action recognition with human skeleton joints

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition.

3D Action Recognition Using Multi-Temporal Skeleton Visualization.

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions

Spectral studies on metal-ligand bonding of novel rhodanine azodye sulphadrugs.

MS<SUP>2</SUP>L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Global Co-occurrence Feature Learning and Active Coordinate System Conversion for Skeleton-based Action Recognition

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Learning Discriminative Representations for Skeleton Based Action Recognition

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Generic Enhanced Ensemble Learning with Multi-Level Kinematic Constraints for 3D Action Recognition

Human-centric multimodal fusion network for robust action recognition

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Human-Robot Collaboration Through a Multi-Scale Graph Convolution Neural Network With Temporal Attention