Abstract:Inspired by recent advances in neural machine translation, that jointly align and translate using encoder-decoder networks equipped with attention, we propose an attentionbased LSTM model for human activity recognition. Our model jointly learns to classify actions and highlight frames associated with the action, by attending to salient visual information through a jointly learned soft-attention networks. We explore attention informed by various forms of visual semantic features, including those encoding actions, objects and scenes. We qualitatively show that soft-attention can learn to effectively attend to important objects and scene information correlated with specific human actions. Further, we show that, quantitatively, our attention-based LSTM outperforms the vanilla LSTM and CNN models used by stateof-the-art methods. On a large-scale youtube video dataset, ActivityNet, our model outperforms competing methods in action classification.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the problem of Human Activity Recognition in videos. Specifically, the authors propose an LSTM model based on an attention mechanism, which can simultaneously classify actions in videos and highlight frames related to these actions. In this way, the model not only improves the accuracy of action classification but also better focuses on key information in the video during training and inference. ### Main Contributions 1. **Introduction of Temporal Attention Mechanism**: Unlike traditional LSTM models, this model uses a temporal attention mechanism, which allows for more flexible selection of important frames when processing long sequences. This enables the model to better capture the temporal dynamic features of actions. 2. **Multimodal Semantic Feature Fusion**: The model utilizes various visual semantic features (such as objects, scenes, and actions) and combines these features with the LSTM model through the attention mechanism, thereby improving the model's performance. 3. **Significant Performance Improvement**: Experimental results show that this model significantly outperforms existing methods on the large-scale ActivityNet dataset, particularly in the action classification task, with performance improvements of up to 20% or 8 percentage points. 4. **Model Interpretability**: Through the attention mechanism, the model can automatically generate weights for important frames, which not only improves the model's performance but also increases its interpretability, allowing researchers to better understand the model's working principles. ### Method Overview 1. **Input Data Encoding**: - **Input-data**: Use the C3D network to extract spatiotemporal features at the video level. - **Attended-data**: Use the VGG network to extract semantic features at the frame level (such as objects, scenes, and actions). 2. **Attention Mechanism**: - Generate attention weights for each frame through a feedforward neural network, reflecting the relevance of the frame to the current predicted action label. - Normalize these weights using the softmax function to obtain the final attention weights. 3. **LSTM Decoder**: - Use the LSTM decoder to sequentially update the hidden state and internal memory, and perform action classification based on the attention-weighted context vector. ### Experimental Results - **Quantitative Results**: On the ActivityNet dataset, the model's average accuracy and mean Average Precision (mAP) are significantly higher than existing methods. - **Qualitative Results**: By visualizing the attention weights, it can be seen that the model effectively focuses on frames related to specific actions, thereby validating the model's effectiveness and interpretability. ### Conclusion By introducing a temporal attention mechanism and multimodal semantic features, this paper significantly improves the performance of Human Activity Recognition in videos and increases the model's interpretability. This approach provides new ideas and tools for future video understanding research.

Action Classification and Highlighting in Videos

Entropy Guided Attention Network for Weakly-Supervised Action Localization.

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Human action recognition using attention based LSTM network with dilated CNN features

Human Action Recognition Using Deep Learning Methods.

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network.

Human Action Recognition Using Two-Stream Attention Based LSTM Networks

Beyond Frame-level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition.

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Interpretable Spatio-temporal Attention for Video Action Recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Human Action Recognition From Digital Videos Based on Deep Learning.

A Joint Model for Action Localization and Classification in Untrimmed Video with Visual Attention

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

A human activity recognition framework in videos using segmented human subject focus

Class Semantics-based Attention for Action Detection

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

Dense Semantics-Assisted Networks For Video Action Recognition

Human Action Recognition Based on Improved Fusion Attention CNN and RNN

Constructing a Highlight Classifier with an Attention Based LSTM Neural Network