Stateful Human-Centered Visual Captioning System to Aid Video Surveillance

Summra Saleem,Aniqa Dilawari,Usman Ghani Khan,Razi Iqbal,Shaohua Wan,Tariq Umer
DOI: https://doi.org/10.1016/j.compeleceng.2019.07.009
IF: 4.152
2019-01-01
Computers & Electrical Engineering
Abstract:The study of Natural Language Generation (NLG), especially how human beings narrate the world, assists in understanding of the visual world for surveillance. Our research proposes an effective technique to axiomatically develop multi-line textual description of visual data by exploiting deep Convolution Neural Networks (CNN). Textual description of visual data aids in providing textual tags for visual information. A human can retrieve elected videos from a repository based on visual tags. Videos contain more complex and detailed information than images and provide more language data. The proposed feats-rich model encodes the visual contents to visual and facial features using CNN architecture. Encoded features are passed to two layer LSTM units with attention mechanism, reducing the number of parameters by encompassing relevant details. Experimental results on Trecvid 2016 and UET-Surveillance dataset depict that model outperforms state-of-the-art methods by scoring BLEU score of 0.35 and 0.52, respectively. (C) 2019 Elsevier Ltd. All rights reserved.
What problem does this paper attempt to address?