Abstract:The lack of occlusion data in commonly used action recognition video datasets limits model robustness and impedes sustained performance improvements. We construct OccludeNet, a large-scale occluded video dataset that includes both real-world and synthetic occlusion scene videos under various natural environments. OccludeNet features dynamic tracking occlusion, static scene occlusion, and multi-view interactive occlusion, addressing existing gaps in data. Our analysis reveals that occlusion impacts action classes differently, with actions involving low scene relevance and partial body visibility experiencing greater accuracy degradation. To overcome the limitations of current occlusion-focused approaches, we propose a structural causal model for occluded scenes and introduce the Causal Action Recognition (CAR) framework, which employs backdoor adjustment and counterfactual reasoning. This framework enhances key actor information, improving model robustness to occlusion. We anticipate that the challenges posed by OccludeNet will stimulate further exploration of causal relations in occlusion scenarios and encourage a reevaluation of class correlations, ultimately promoting sustainable performance improvements. The code and full dataset will be released soon.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limited improvement in model robustness and performance in video action recognition due to occlusion. Specifically: 1. **Limitations of the dataset**: Existing video datasets for action recognition lack data in occlusion scenarios, which restricts the performance of models in the real world. In particular, when actors are partially or completely occluded, the performance of the models will decline significantly. 2. **The impact of occlusion on different action categories**: The degree of impact of occlusion on different action categories varies. Especially for those actions with low background relevance or low partial - body visibility, occlusion will lead to a more severe decline in accuracy. 3. **Shortcomings of existing methods**: Current de - occlusion models usually focus on minimizing the impact of the occluded area, but these methods often overlook the inter - relationships between scene elements and are unable to capture the causal relationships among occluders, backgrounds, visible parts of actors and predictions. To overcome these problems, the paper proposes the following solutions: - **Constructing the OCCLUDE NET dataset**: This is a large - scale occluded video dataset, which includes videos of real - world and synthetic occlusion scenarios, covering multiple occlusion types such as dynamic - tracking occlusion, static - scene occlusion and multi - view - interaction occlusion. - **Introducing the Causal Action Recognition (CAR) framework**: Through the Structural Causal Model (SCM) and Counterfactual Reasoning, enhance the model's causal attention to the features of unoccluded actors, thereby improving the model's robustness in occluded environments. - **Analyzing the impact of occlusion on different action categories**: The study found that occlusion has a greater impact on actions involving low - scene relevance and partial - body visibility, emphasizing the necessity of adopting customized methods for different occlusion strategies. Through these methods, the paper aims to promote research on action recognition in occluded environments and facilitate the application of models in complex real - world scenarios.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes

Now You See Me: Robust approach to Partial Occlusions

Annealing Temporal-Spatial Contrastive Learning for Multi-View Online Action Detection

Learning an Occlusion-Aware Network for Video Deblurring

Multi-view graph convolution network for the recognition of human action with spatial and temporal occlusion problems

Occluded Video Instance Segmentation: A Benchmark

Action Understanding with Multiple Classes of Actors

Towards Causal Relationship in Indefinite Data: Baseline Model and New Datasets

Harnessing Temporal Causality for Advanced Temporal Action Detection

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

A comprehensive framework for occluded human pose estimation

Spatial-Temporal Alignment Network for Action Recognition and Detection

Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge

Implicit Affordance Acquisition via Causal Action-Effect Modeling in the Video Domain

Privacy-Preserving Deep Action Recognition: An Adversarial Learning Framework and A New Dataset

CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes