Glocal Cascading Network for Topic Enhanced Visual Storytelling

Jiaqi Su,Weiran Chen,Yi Ji,Chunping Liu
DOI: https://doi.org/10.1109/icassp48485.2024.10447361
2024-01-01
Abstract:As a cross-modal task, visual storytelling aims to generate a semantically coherent story for an ordered image sequence. Despite significant achievements in existing methods for this task, few works focus on improving the conception ability which humans usually use when writing stories. In this work, we propose a framework called GLocal Cascading Network for Topic Enhanced Visual Storytelling which explores the conception ability by pre-modeling a latent topic for each image during story telling. Inspired by the global-local (glocal) ideology, we firstly propose a hierarchical latent-topic decoder consisting of two levels of topic generator which respectively focus on different levels of topic information. Then we propose a topic-aware loss which encourages the model to focus on the topic information of the story. With these two novel modules, our framework can effectively utilize the topic information and improve the informativeness and consistency of stories. Our model has been proven highly competitive across multiple metrics through extensive experiments conducted on the VIST dataset.
What problem does this paper attempt to address?