Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

Yiping Wei,Kunyu Peng,Alina Roitberg,Jiaming Zhang,Junwei Zheng,Ruiping Liu,Yufan Chen,Kailun Yang,Rainer Stiefelhagen

2024-01-11

Abstract:Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at <a class="link-external link-https" href="https://github.com/desehuileng0o0/IKEM" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Multimedia,Robotics,Image and Video Processing

What problem does this paper attempt to address?

The paper attempts to address the problem of how to efficiently utilize multi-modal data in self-supervised human action recognition, particularly based on skeletal data. Existing works mostly rely on three basic modalities (joints, bones, motion) but overlook the performance differences between modalities, leading to the propagation of erroneous knowledge between modalities. Additionally, these methods do not explore additional modalities to enrich information. To address these issues, the paper proposes the following innovations: 1. **Implicit Knowledge Exchange Module (IKEM)**: This module implicitly transfers knowledge by evaluating the similarity between all modalities, avoiding the method of explicitly mining additional positive samples, thereby reducing the propagation of erroneous knowledge. 2. **Introduction of Three New Modalities**: In addition to the original joint, bone, and motion modalities, the paper introduces three new modalities: acceleration, rotation axis direction, and joint angular velocity, to enhance complementary information between modalities. 3. **Cross-Modal Knowledge Distillation Framework**: To maintain the efficiency of the model while introducing new modalities, the paper proposes a teacher-student model that distills the knowledge of secondary modalities into the primary modality through negative and positive sample pairs, thereby improving performance without significantly increasing model complexity. Through these methods, the experimental results on the NTU-RGB+D 60 dataset show that the proposed method achieves significant performance improvements in both cross-subject and cross-view evaluations.

Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

MS<SUP>2</SUP>L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions

Semi-supervised learning for skeleton behavior recognition: A multi-dimensional graph comparison approach

Enhancing Skeleton-Based Action Recognition with Language Descriptions from Pre-trained Large Multimodal Models

EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition

Contrast-reconstruction Representation Learning for Self-supervised Skeleton-based Action Recognition

A Bidirectional Separated Distillation-Based Cross-Modal Interactive Fusion Network for Skeleton-Based Action Recognition

A Key Skeleton Points Guided Classroom Action Recognition Method Based on Multimodal Symmetry Fusion

Hierarchical Human Action Recognition With Self-Selection Classifiers Via Skeleton Data

Temporal Cues Enhanced Multimodal Learning for Action Recognition in RGB-D Videos

Skeleton-based Action Recognition with Non-linear Dependency Modeling and Hilbert-Schmidt Independence Criterion

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition

Skeleton-Based Human Action Recognition Via Multi-Knowledge Flow Embedding Hierarchically Decomposed Graph Convolutional Network.