Collaborative Detection and Caption Network

Tianyi Wang,Jiang Zhang,Zheng-Jun Zha
DOI: https://doi.org/10.1007/978-3-030-00776-8_10
2018-01-01
Abstract:Recently it has been shown that deep recurrent neural network can be utilized to train video captioning systems. However, existing approaches often are perplexed by vagueness among videos, which often lead to grammatical correct but less germane results. In this paper, we propose an effective end-to-end network, called a Collaborative Detection and Caption network, which takes a video caption network as video-to-sentence sub-network and principle syntactic components detector as video-to-words sub-network. Our detector and caption network warp spatial-correlated attributes with temporal attention model and are optimized jointly which could facilitate each other. Experiments on the YouTube2Text, MPII movie description datasets and MVAD datasets consistently show that our proposed network can generate crucial contents needed for describing videos and thus enhance caption and detection performance simultaneously. Also, metric scores reported on those benchmarks have outperforms the state-of-the-art methods.
What problem does this paper attempt to address?