Action Classification and Highlighting in Videos

Atousa Torabi,Leonid Sigal
DOI: https://doi.org/10.48550/arXiv.1708.09522
2017-08-31
Abstract:Inspired by recent advances in neural machine translation, that jointly align and translate using encoder-decoder networks equipped with attention, we propose an attentionbased LSTM model for human activity recognition. Our model jointly learns to classify actions and highlight frames associated with the action, by attending to salient visual information through a jointly learned soft-attention networks. We explore attention informed by various forms of visual semantic features, including those encoding actions, objects and scenes. We qualitatively show that soft-attention can learn to effectively attend to important objects and scene information correlated with specific human actions. Further, we show that, quantitatively, our attention-based LSTM outperforms the vanilla LSTM and CNN models used by stateof-the-art methods. On a large-scale youtube video dataset, ActivityNet, our model outperforms competing methods in action classification.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the problem of Human Activity Recognition in videos. Specifically, the authors propose an LSTM model based on an attention mechanism, which can simultaneously classify actions in videos and highlight frames related to these actions. In this way, the model not only improves the accuracy of action classification but also better focuses on key information in the video during training and inference. ### Main Contributions 1. **Introduction of Temporal Attention Mechanism**: Unlike traditional LSTM models, this model uses a temporal attention mechanism, which allows for more flexible selection of important frames when processing long sequences. This enables the model to better capture the temporal dynamic features of actions. 2. **Multimodal Semantic Feature Fusion**: The model utilizes various visual semantic features (such as objects, scenes, and actions) and combines these features with the LSTM model through the attention mechanism, thereby improving the model's performance. 3. **Significant Performance Improvement**: Experimental results show that this model significantly outperforms existing methods on the large-scale ActivityNet dataset, particularly in the action classification task, with performance improvements of up to 20% or 8 percentage points. 4. **Model Interpretability**: Through the attention mechanism, the model can automatically generate weights for important frames, which not only improves the model's performance but also increases its interpretability, allowing researchers to better understand the model's working principles. ### Method Overview 1. **Input Data Encoding**: - **Input-data**: Use the C3D network to extract spatiotemporal features at the video level. - **Attended-data**: Use the VGG network to extract semantic features at the frame level (such as objects, scenes, and actions). 2. **Attention Mechanism**: - Generate attention weights for each frame through a feedforward neural network, reflecting the relevance of the frame to the current predicted action label. - Normalize these weights using the softmax function to obtain the final attention weights. 3. **LSTM Decoder**: - Use the LSTM decoder to sequentially update the hidden state and internal memory, and perform action classification based on the attention-weighted context vector. ### Experimental Results - **Quantitative Results**: On the ActivityNet dataset, the model's average accuracy and mean Average Precision (mAP) are significantly higher than existing methods. - **Qualitative Results**: By visualizing the attention weights, it can be seen that the model effectively focuses on frames related to specific actions, thereby validating the model's effectiveness and interpretability. ### Conclusion By introducing a temporal attention mechanism and multimodal semantic features, this paper significantly improves the performance of Human Activity Recognition in videos and increases the model's interpretability. This approach provides new ideas and tools for future video understanding research.