Exploiting Objects with LSTMs for Video Categorization

Yongqing Sun,Zuxuan Wu,Xi Wang,Hiroyuki Arai,Tetsuya Kinebuchi,Yu-Gang Jiang
DOI: https://doi.org/10.1145/2964284.2967199
2016-01-01
Abstract:Temporal dynamics play an important role for video classification. In this paper, we propose to leverage high-level semantic features to open the "black box" of the state-of-the-art temporal model, Long Short Term Memory (LSTM), with an aim to understand what is learned. More specifically, we first extract object features from a state-of-the-art CNN model that is trained to recognize 20K objects. Then we leverage LSTM with the extracted features as inputs to capture the temporal dynamics in videos. In combination with spatial and motion information, we achieve improvements for supervised video categorization. Furthermore, by masking the inputs, we demonstrate what is learned by LSTM, namely (i) which objects are crucial for recognizing a class-of-interest; (ii) how the LSTM model could assist the temporal localization of these detected objects.
What problem does this paper attempt to address?