Egocentric zone-aware action recognition across environments
Simone Alberto Peirone,Gabriele Goletto,Mirco Planamente,Andrea Bottino,Barbara Caputo,Giuseppe Averta
2024-09-22
Abstract:Human activities exhibit a strong correlation between actions and the places where these are performed, such as washing something at a sink. More specifically, in daily living environments we may identify particular locations, hereinafter named activity-centric zones, which may afford a set of homogeneous actions. Their knowledge can serve as a prior to favor vision models to recognize human activities. However, the appearance of these zones is scene-specific, limiting the transferability of this prior information to unfamiliar areas and domains. This problem is particularly relevant in egocentric vision, where the environment takes up most of the image, making it even more difficult to separate the action from the context. In this paper, we discuss the importance of decoupling the domain-specific appearance of activity-centric zones from their universal, domain-agnostic representations, and show how the latter can improve the cross-domain transferability of Egocentric Action Recognition (EAR) models. We validate our solution on the EPIC-Kitchens-100 and Argo1M datasets
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to achieve human activity recognition (Egocentric Action Recognition, EAR) from the first - person perspective (egocentric vision) in different environments? In particular, in unseen environments, how can the model use environmental affordances to perform more accurate action recognition?
Specifically, the paper points out that current EAR models will learn the associations between actions and locations in specific environments during training. This association (i.e., co - occurrence bias) makes the model rely on the specific appearance features of the environment. However, when these models face new, unseen environments, because they cannot effectively ignore the changes in the environmental appearance, their performance drops significantly.
To solve this problem, the paper proposes a new method - EgoZAR (Egocentric Zone - Aware Action Recognition), aiming to improve the generalization ability of the model in different environments by extracting and using the general representations of activity - centric zones that are independent of the environment. Specific contributions include:
1. **Revealing the side effects of co - occurrence bias**: It is pointed out that existing EAR models will indirectly learn domain - specific information of the environment (domain - specific activity - centric zones) when processing first - person videos, which limits the performance of the model in new environments.
2. **Proposing the EgoZAR architecture**: Adopting a more general representation of activity - centric zones to improve the action recognition performance in unseen domains, enabling the model to use environmental affordances in unknown regions.
3. **Experimental verification**: Through extensive experiments on the EPIC - Kitchens - 100 and Argo1M datasets, it is proved that using domain - independent environmental representations can significantly improve the effect of action recognition, especially in unseen environments, achieving the latest domain generalization performance.
### Formula Summary
Some of the key formulas involved in the paper are as follows:
- **Updating regional features and action features**:
\[
o^z_i = x^z_i+\sigma\left(\frac{Q_z(x^z_i)K_z(x^z_i)^T}{\sqrt{D_z}}\cdot V_z(x^z_i)\right),\quad\tilde{x}^z_i = o^z_i+F_z(o^z_i)
\]
\[
o^m_i = x^m_i+\sigma\left(\frac{Q_m(x^z_i)K_m(x^m_i)^T}{\sqrt{D_z}}\cdot V_m(x^m_i)\right),\quad\tilde{x}^m_i = o^m_i+F_m(o^m_i)
\]
- **Pseudo - label assignment**:
\[
y^z_i=\min_k\|x^z_i - c_k\|_2
\]
These formulas show how EgoZAR separates regional features and action features through the attention mechanism and assigns pseudo - labels to each sample through the clustering algorithm, thereby achieving unsupervised learning of activity - centric zones.
In conclusion, this paper solves the problem of performance degradation of existing EAR models in new environments by introducing general representations of activity - centric zones and improves the generalization ability of the model.