S2TD - A Tree-Structured Decoder for Image Paragraph Captioning.

Yihui Shi,Yun Liu,Fangxiang Feng,Ruifan Li,Zhanyu Ma,Xiaojie Wang
DOI: https://doi.org/10.1145/3469877.3490585
2021-01-01
Abstract:Image paragraph captioning, a task to generate the paragraph description for a given image, usually requires mining and organizing linguistic counterparts from abundant visual clues. Limited by sequential decoding perspective, previous methods have difficulty in organizing the visual clues holistically or capturing the structural nature of linguistic descriptions. In this paper, we propose a novel tree-structured visual paragraph decoder network, called Splitting to Tree Decoder (S2TD) to address this problem. The key idea is to model the paragraph decoding process as a top-down binary tree expansion. S2TD consists of three modules: a split module, a score module, and a word-level RNN. The split module iteratively splits ancestral visual representations into two parts through a gating mechanism. To determine the tree topology, the score module uses cosine similarity to evaluate the nodes splitting. A novel tree structure loss is proposed to enable end-to-end learning. After the tree expansion, the word-level RNN decodes leaf nodes into sentences forming a coherent paragraph. Extensive experiments are conducted on the Stanford benchmark dataset. The experimental results show promising performance of our proposed S2TD.
What problem does this paper attempt to address?