A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

Yong Cheng,Fei Huang,Lian Zhou,Cheng Jin,Yuejie Zhang,Tao Zhang
DOI: https://doi.org/10.1145/3077136.3080671
2017-01-01
Abstract:A novel hierarchical multimodal attention-based model is developed in this paper to generate more accurate and descriptive captions for images. Our model is an \"end-to-end\" neural network which contains three related sub-networks: a deep convolutional neural network to encode image contents, a recurrent neural network to identify the objects in images sequentially, and a multimodal attention-based recurrent neural network to generate image captions. The main contribution of our work is that the hierarchical structure and multimodal attention mechanism is both applied, thus each caption word can be generated with the multimodal attention on the intermediate semantic objects and the global visual content. Our experiments on two benchmark datasets have obtained very positive results.
What problem does this paper attempt to address?