Abstract:The sequential execution of actions and their hierarchical structure consisting of different levels of abstraction, provide features that remain unexplored in the task of action recognition. In this study, we present a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and prior actions to reflect the sequential context. To achieve this goal, we introduce a novel transformer architecture tailored for action recognition that utilizes both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse and fine-grained action recognition, thereby exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset to introduce action hierarchies, introducing the Hierarchical TSU dataset. We also conduct an ablation study to assess the impact of different methods for integrating contextual and hierarchical data on action recognition performance. Results show that the proposed approach outperforms pre-trained SOTA methods when trained with the same hyperparameters. Moreover, they also show a 17.12% improvement in top-1 accuracy over the equivalent fine-grained RGB version when using ground-truth contextual information, and a 5.33% improvement when contextual information is obtained from actual predictions.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to improve action recognition tasks by leveraging the hierarchical structure of actions and contextual information from text. Specifically, the authors propose a novel approach to enhance the accuracy of action recognition through the following points: 1. **Utilizing the Hierarchical Structure of Actions**: Actions typically have different levels of abstraction, and these hierarchical structures have not been fully utilized in existing action recognition tasks. The authors introduce a hierarchical action classification system that combines coarse-grained and fine-grained action categories to better capture the complexity and diversity of actions. 2. **Fusing Visual and Textual Features**: In addition to traditional visual features (such as RGB images and optical flow data), the authors introduce text embeddings to represent contextual information, including location and previous actions. These textual features, combined with visual features, provide a richer description of actions, thereby improving recognition accuracy. 3. **Introducing a New Transformer Architecture**: The authors design a Transformer architecture specifically for action recognition, capable of processing both visual and textual features simultaneously. This architecture captures long-term dependencies through a self-attention mechanism, thus better modeling action sequences over time. 4. **Joint Loss Function**: To train both coarse-grained and fine-grained action classifications simultaneously, the authors define a joint loss function. This approach not only improves the accuracy of fine-grained action recognition but also enhances the overall performance of the model through coarse-grained supervision. ### Experiments and Results To validate the effectiveness of the proposed method, the authors extended the existing Toyota Smarthome Untrimmed (TSU) dataset by introducing a hierarchical structure of actions, forming the Hierarchical TSU dataset. Experimental results show that the proposed method outperforms existing state-of-the-art methods on multiple metrics, particularly achieving a 5.33% improvement in top-1 accuracy when using real predicted contextual information. ### Main Contributions 1. **Proposed a New Vision-Language Transformer Architecture**: This architecture leverages the hierarchical structure of actions and contextual information, significantly improving the accuracy of action recognition. 2. **Constructed the Hierarchical TSU Dataset**: This is an extended TSU dataset that includes coarse-grained and fine-grained action annotations as well as rich textual information. 3. **Conducted Extensive Experiments**: Through detailed ablation studies, the authors analyzed the impact of different components and contextual information on model performance. Overall, this paper provides a more effective method for action recognition by introducing multi-level action structures and contextual information, offering new insights and tools for research in the related field.

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Reassessing Hierarchical Representation for Action Recognition in Still Images

Online Robust Action Recognition Based on a Hierarchical Model

Hierarchical compositional representations for few-shot action recognition

Semantic-Disentangled Transformer With Noun-Verb Embedding for Compositional Action Recognition

Action Recognition by Hierarchical Mid-level Action Elements

SVFormer: Semi-supervised Video Transformer for Action Recognition

Learning Hierarchical Video Representation for Action Recognition

A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets

ActionFormer: Localizing Moments of Actions with Transformers

SITAR: Semi-supervised Image Transformer for Action Recognition

Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions

Efficient Action Recognition with Introducing R(2+1)D Convolution to Improved Transformer

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Perceiving Actions via Temporal Video Frame Pairs

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

A Novel Hierarchical Framework for Human Action Recognition

Hierarchical Temporal Memory Enhanced One-Shot Distance Learning for Action Recognition