Image-To-Tree: A Tree-Structured Decoder For Image Captioning

Zhiming Ma,Chun Yuan,Yangyang Cheng,Xinrui Zhu
DOI: https://doi.org/10.1109/ICME.2019.00225
2019-01-01
Abstract:Automatically generating natural language descriptions of images is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In recent years tremendous success has been shown in image captioning under the encoder-decoder framework, in which decoders are often chain-structured with Recurrent Neural Networks(RNNs), treating sentences as sequences. However, natural sentences are not inherently linear structures, but hierarchical structures. In this paper, we for the first time proposed a model with tree-structured decoder for image captioning(Image-to-Tree), which does not directly generate sentences but instead explicitly generates their dependency trees in a top-down manner. Inspired by the success of attention mechanism in image captioning, we also proposed a corresponding attention-based model for Image-to-Tree. Experiments on MSCOCO dataset demonstrate that our model can achieve comparable results to chain-structured models of different language metrics.
What problem does this paper attempt to address?