Abstract:Image captioning is a challenging task that generates a natural language description based on the visual understanding of the given image. Significant region representation is a milestone in image captioning. Despite the great success of existing region-based works, they only focus on the salient objects and encode these objects independently, still plagued by the lack of global contextual information and visual relationships. In fact, the global contextual information and structured visual relationships are exactly the merits of traditional grid features and emerging scene graph features. In this paper, we present a Triple-Steam Feature Fusion Network (TSFNet) to leverage the complementary advantages of the grid, region, and scene graph triple-steam visual representations in image captioning. Concretely, in our TSFNet, a novel Dual-level Attention (DA) mechanism is proposed to simultaneously explore visual intrinsic properties and word-related attributes uniformly of different features. Then attention enhanced features of different modalities are mapped into a joint representation to guide the caption generation. Moreover, we design a new global-aware decoder, which leverages the concatenated representation of triple-steam features and the joint attention representation to obtain global visual guidance information, further refine the complex multimodal reasoning. To verify the effectiveness of our feature fusion model, we perform extensive experiments on the highly competitive MSCOCO dataset to evaluate the model quantitatively and qualitatively. The results illustrate that the proposed framework outperforms many state-of-the-art image captioning approaches in various evaluation metrics, and generates more accurate and abundant captions.

FFGS: Feature Fusion with Gating Structure for Image Caption Generation.

3G Structure for Image Caption Generation

Feature Fusion Based on Neural Image Captioning with Spatial Attention

Fine-Grained Features for Image Captioning

Image Caption Generation Using Contextual Information Fusion with Bi-LSTM-s

Llafn-Generator: Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning

Controllable Video Captioning With Pos Sequence Guidance Based On Gated Fusion Network

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

A Parallel-Fusion RNN-LSTM Architecture for Image Caption Generation

ON-AFN: Generating Image Caption Based on the Fusion of Residual Attention and Ordered Memory Module

Parallel-fusion LSTM with synchronous semantic and visual information for image captioning

CASCADE ATTENTION FUSION FOR FINE-GRAINED IMAGE CAPTIONING BASED ON MULTI-LAYER LSTM

Attention-gated LSTM for Image Captioning

TSFNet: Triple-Steam Image Captioning

Image Captioning with Local-Global Visual Interaction Network.

GateCap: Gated Spatial and Semantic Attention Model for Image Captioning

Multi-modal gated recurrent units for image description

Layer-wise enhanced transformer with multi-modal fusion for image caption

Gated Object-Attribute Matching Network for Detailed Image Caption

Local-global Visual Interaction Attention for Image Captioning

Multi-Gate Attention Network for Image Captioning