Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network

Jie Shao,Chen Mo
DOI: https://doi.org/10.1134/s1054661822020183
2022-01-01
Pattern Recognition and Image Analysis
Abstract:In this paper, we propose a self-attention based multi-modal LSTM framework for the challenging task of action anticipation in first-person videos. Our framework comprehensively considers three video features: RGB images for spatial information, optical flow fields for temporal information, and object-based features to figure out which object the camera wearer interacts with. Different from some previous works that directly utilize features after convolutional layers, we encode multi-modal features by a self-attention mechanism based on the similarity between text sequences and video sequences. The positional vector based on trigonometric function is added to encode the position of the frame so that the self-attention module can learn the position information of the sequence. We use multi-modal LSTMs to load the historical information of the video and generate predictions at different anticipation times. The performance of the proposed method is evaluated on two benchmark datasets, which shows that our framework outperforms the state-of-the-art approaches on metrics, and solved the problem of poor long-term prediction.
What problem does this paper attempt to address?