Cross-view Action Recognition via Contrastive View-invariant Representation

Yuexi Zhang,Dan Luo,Balaji Sundareshan,Octavia Camps,Mario Sznaier

2023-05-03

Abstract:Cross view action recognition (CVAR) seeks to recognize a human action when observed from a previously unseen viewpoint. This is a challenging problem since the appearance of an action changes significantly with the viewpoint. Applications of CVAR include surveillance and monitoring of assisted living facilities where is not practical or feasible to collect large amounts of training data when adding a new camera. We present a simple yet efficient CVAR framework to learn invariant features from either RGB videos, 3D skeleton data, or both. The proposed approach outperforms the current state-of-the-art achieving similar levels of performance across input modalities: 99.4% (RGB) and 99.9% (3D skeletons), 99.4% (RGB) and 99.9% (3D Skeletons), 97.3% (RGB), and 99.2% (3D skeletons), and 84.4%(RGB) for the N-UCLA, NTU-RGB+D 60, NTU-RGB+D 120, and UWA3DII datasets, respectively.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the issue of Cross-view Action Recognition (CVAR). Specifically: 1. **Problem Description**: - **Objective**: To recognize a person's actions from previously unseen viewpoints. - **Challenge**: Due to different viewpoints, the same action can appear significantly different, making this task highly challenging. 2. **Application Scenarios**: - Surveillance and Assisted Living Facility Monitoring: When adding new cameras, collecting a large amount of training data is neither practical nor feasible, thus requiring methods that can handle unseen viewpoints. 3. **Method Contributions**: - Proposes a simple yet efficient CVAR framework that can learn invariant features from RGB videos, 3D skeleton data, or a combination of both. - The method achieves excellent performance on multiple benchmark datasets (such as N-UCLA, NTU-RGB+D 60, NTU-RGB+D 120, and UWA3DII), reaching or even surpassing the level of existing state-of-the-art methods. 4. **Technical Highlights**: - Utilizes Dynamics-based Invariant Representation (DIR) to capture dynamic information in joint movements. - Achieves high performance even when using only RGB data, making it possible to train with smaller datasets and avoiding the need for expensive 3D data. In summary, the paper aims to develop a new method capable of recognizing actions from different viewpoints, thereby overcoming the limitations of traditional methods that rely on specific viewpoints or large-scale datasets.

Cross-view Action Recognition via Contrastive View-invariant Representation

Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

View-invariant action recognition:a survey

3D Human Action Representation Learning via Cross-View Consistency Pursuit

Multi-layer Representation for Cross-view Action Recognition

Hierarchically Learned View-Invariant Representations for Cross-View Action Recognition

Unsupervised View-Invariant Human Posture Representation

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Discriminative virtual views for cross-view action recognition

Arbitrary-view human action recognition via novel-view action generation

Cross-modality Online Distillation for Multi-View Action Recognition

View-Robust Neural Networks for Unseen Human Action Recognition in Videos

Cross-view action recognition via view knowledge transfer

Cross-view Action Modeling, Learning and Recognition

A Large-scale Varying-view RGB-D Action Dataset for Arbitrary-view Human Action Recognition

Annealing Temporal-Spatial Contrastive Learning for Multi-View Online Action Detection

Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

Action Recognition with Domain Invariant Features of Skeleton Image

Collaborative Attention Mechanism for Multi-View Action Recognition