Abstract:Visual narrating focuses on generating semantic descriptions to summarize visual content of images or videos, e.g., visual captioning and visual storytelling. The challenge mainly lies in how to design a decoder to generate accurate descriptions matching visual content. Recent advances often employ a recurrent neural network (RNN), e.g., Long-Short Term Memory (LSTM), as the decoder. However, RNN is prone to diluting long-term information, which weakens its performance of capturing long-term dependencies. Recent work has demonstrated memory network (MemNet) owns the advantage of storing long-term information. However, as the decoder, it has not been well exploited for visual narrating. The reason partially comes from the difficulty of multi-modal sequential decoding with MemNet. In this article, we devise a novel memory decoder for visual narrating. Concretely, to obtain a better multi-modal representation, we first design a new multi-modal fusion method to fully merge visual and lexical information. Then, based on the fusion result, during decoding, we construct a MemNet-based decoder consisting of multiple memory layers. Particularly, in each layer, we employ a memory set to store previous decoding information and utilize an attention mechanism to adaptively select the information related to the current output. Meanwhile, we also employ a memory set to store the decoding output of each memory layer at the current time step and still utilize an attention mechanism to select the related information. Thus, this decoder alleviates dilution of long-term information. Meanwhile, the hierarchical architecture leverages the latent information of each layer, which is helpful for generating accurate descriptions. Experimental results on two tasks of visual narrating, i.e., video captioning and visual storytelling, show that our decoder could obtain superior results and outperform the performance of conventional RNN-based decoder.

AOG-LSTM: An Adaptive Attention Neural Network for Visual Storytelling

TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Emotion Reinforced Visual Storytelling.

Video Captioning With Attention-Based LSTM and Semantic Consistency

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

TemporalStory: Enhancing Consistency in Story Visualization Using Spatial-Temporal Attention

Hierarchical Memory Decoder for Visual Narrating

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Hierarchically-Attentive RNN for Album Summarization and Storytelling

Contextualize, Show and Tell: A Neural Visual Storyteller

Latent Memory-Augmented Graph Transformer for Visual Storytelling

Adaptively Aligned Image Captioning via Adaptive Attention Time

Semantic Representation and Attention Alignment for Graph Information Bottleneck in Video Summarization

Improving Visual Storytelling with Multimodal Large Language Models

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

StoryGPT-V: Large Language Models as Consistent Story Visualizers

Storytelling from an Image Stream Using Scene Graphs

Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling