Cross-view Action Recognition via Contrastive View-invariant Representation

Yuexi Zhang,Dan Luo,Balaji Sundareshan,Octavia Camps,Mario Sznaier
2023-05-03
Abstract:Cross view action recognition (CVAR) seeks to recognize a human action when observed from a previously unseen viewpoint. This is a challenging problem since the appearance of an action changes significantly with the viewpoint. Applications of CVAR include surveillance and monitoring of assisted living facilities where is not practical or feasible to collect large amounts of training data when adding a new camera. We present a simple yet efficient CVAR framework to learn invariant features from either RGB videos, 3D skeleton data, or both. The proposed approach outperforms the current state-of-the-art achieving similar levels of performance across input modalities: 99.4% (RGB) and 99.9% (3D skeletons), 99.4% (RGB) and 99.9% (3D Skeletons), 97.3% (RGB), and 99.2% (3D skeletons), and 84.4%(RGB) for the N-UCLA, NTU-RGB+D 60, NTU-RGB+D 120, and UWA3DII datasets, respectively.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses the issue of Cross-view Action Recognition (CVAR). Specifically: 1. **Problem Description**: - **Objective**: To recognize a person's actions from previously unseen viewpoints. - **Challenge**: Due to different viewpoints, the same action can appear significantly different, making this task highly challenging. 2. **Application Scenarios**: - Surveillance and Assisted Living Facility Monitoring: When adding new cameras, collecting a large amount of training data is neither practical nor feasible, thus requiring methods that can handle unseen viewpoints. 3. **Method Contributions**: - Proposes a simple yet efficient CVAR framework that can learn invariant features from RGB videos, 3D skeleton data, or a combination of both. - The method achieves excellent performance on multiple benchmark datasets (such as N-UCLA, NTU-RGB+D 60, NTU-RGB+D 120, and UWA3DII), reaching or even surpassing the level of existing state-of-the-art methods. 4. **Technical Highlights**: - Utilizes Dynamics-based Invariant Representation (DIR) to capture dynamic information in joint movements. - Achieves high performance even when using only RGB data, making it possible to train with smaller datasets and avoiding the need for expensive 3D data. In summary, the paper aims to develop a new method capable of recognizing actions from different viewpoints, thereby overcoming the limitations of traditional methods that rely on specific viewpoints or large-scale datasets.