Triangle-Reward Reinforcement Learning

Weizhi Nie,Jiesi Li,Ning Xu,An-An Liu,Xuanya Li,Yongdong Zhang
DOI: https://doi.org/10.1145/3474085.3475604
2021-01-01
Abstract:Image captioning aims to generate a sentence consisting of sequential linguistic words, to describe visual units (i.e., objects, relationships, and attributes) in a given image. Most of existing methods rely on the prevalent supervised learning with cross-entropy (XE) function to transfer visual units into a sequence of linguistic words. However, we argue that the XE objective is not sensitive to visual-linguistic alignment, which cannot discriminately penalize the semantic inconsistency and shrink the context gap. To solve these problems, we propose the Triangle-Reward Reinforcement Learning (TRRL) method. TRRL uses the scene graph (G)---objects as nodes and relationships as edges---to represent images, generated sentences, and ground truth sentences individually, and mutually align them during the training process. Specifically, TRRL formulates the image captioning into cooperative agents, where the first agent aims to extract visual scene graph (Gimg) from image (I) and the second agent translates this graph into sentence (S). To discriminately penalize the visual-linguistic inconsistency, TRRL proposes the novel triangle-reward function: 1) the generated sentence and its corresponding ground truth are decomposed into the linguistic scene graph (Gsen) and ground-truth scene graph (Ggt), respectively; 2) Gimg, Gsen, and Ggt are paired to calculate the semantic similarity scores which are proportionally assigned to reward each agent. Meanwhile, to make the training objective sensitive to context changes, we propose the node-level and triplet-level scoring methods to jointly measure the visual-linguistic graph correlations. Extensive experiments on the MSCOCO dataset demonstrate the superiority of TRRL. Additional ablation studies further validate its effectiveness.
What problem does this paper attempt to address?