Emotion Aware Reinforcement Network for Visual Storytelling

Xin Li,Hanqing Cai,Tianling Jiang,Chunping Liu,Yi Ji
DOI: https://doi.org/10.1007/978-3-031-15931-2_3
2022-01-01
Abstract:Visual storytelling is the task of generating a sequence of human-like sentences (i.e. story) for an ordered stream of images. Unlike traditional image captioning, the story contains not only factual descriptions but also concepts and objects that do not explicitly appear in the input images. Recent works utilize either end-to-end or multi-stage frameworks to produce more relevant and coherent stories but usually ignore latent emotional information. In this work, to generate an affective story, we propose an Emotion Aware Reinforcement Network for VIsual StoryTelling (EARN-VIST). Specifically in our network, lexicon-based attention is leveraged to encourage the model to pay more attention to the emotional words. Then we apply two emotional consistency reinforcement learning rewards using an emotion classifier and commonsense transformer respectively to find the gap between generated story and human-labeled story so as to refine the generation process. Experimental results on the VIST dataset and human evaluation demonstrate that our model outperforms most of the cutting-edge models across multiple evaluation metrics.
What problem does this paper attempt to address?