Watch, Think and Attend: End-to-End Video Classification via Dynamic Knowledge Evolution Modeling.

Junyu Gao,Tianzhu Zhang,Changsheng Xu
DOI: https://doi.org/10.1145/3240508.3240566
2018-01-01
Abstract:Video classification has been achieved by automatically mining the underlying concepts (\eg actions, events) in videos, which plays an essential role in intelligent video analysis. However, most existing algorithms only exploit the visual cues of these concepts but ignore external knowledge information for modeling their relationships during the evolution of videos. In fact, humans have remarkable ability to utilize acquired knowledge to reason about the dynamically changing world. To narrow the knowledge gap between existing methods and humans, we propose an end-to-end video classification framework based on a structured knowledge graph, which can model the dynamic knowledge evolution in videos overtime. Here, we map the concepts of videos to the nodes of the knowledge graph. To effectively leverage the knowledge graph, we adopt a graph convLSTM model to not only identify local knowledge structures in each video shot but also model dynamic patterns of knowledge evolution across these shots. Furthermore, a novel knowledge-based attention model is designed by considering the importance of each video shot and relationships between concepts. We show that by using knowledge graphs, our framework is able to improve the performance of various existing methods. Extensive experimental results on two video classification benchmarks UCF101 and Youtube-8M demonstrate the favorable performance of the proposed framework.
What problem does this paper attempt to address?