Abstract:Semantic attention has been shown to be effective in improving the performance of image captioning. The core of semantic attention based methods is to drive the model to attend to semantically important words, or attributes. In previous works, the attribute detector and the captioning network are usually independent, leading to the insufficient usage of the semantic information. Also, all the detected attributes, no matter whether they are appropriate for the linguistic context at the current step, are attended to through the whole caption generation process. This may sometimes disrupt the captioning model to attend to incorrect visual concepts. To solve these problems, we introduce two end-to-end trainable modules to closely couple attribute detection with image captioning as well as prompt the effective uses of attributes by predicting appropriate attributes at each time step. The multimodal attribute detector (MAD) module improves the attribute detection accuracy by using not only the image features but also the word embedding of attributes already existing in most captioning models. MAD models the similarity between the semantics of attributes and the image object features to facilitate accurate detection. The subsequent attribute predictor (SAP) module dynamically predicts a concise attribute subset at each time step to mitigate the diversity of image attributes. Compared to previous attribute based methods, our approach enhances the explainability in how the attributes affect the generated words and achieves a state-of-the-art single model performance of 128.8 CIDEr-D on the MSCOCO dataset. Extensive experiments on the MSCOCO dataset show that our proposal actually improves the performances in both image captioning and attribute detection simultaneously. The codes are available at: https://github.com/RubickH/Image-Captioning-with-MAD-and-SAP.

Attribute-driven Image Captioning Via Soft-Switch Pointer.

Image Captioning with End-to-End Attribute Detection and Subsequent Attributes Prediction

Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning.

Image Captioning with Visual-Semantic Double Attention

Combining Object-Based Attention And Attributes For Image Captioning

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

Show, Conceive and Tell: Image Captioning with Prospective Linguistic Information

Improving Image Captioning through Visual and Semantic Mutual Promotion

Stimulus-driven and Concept-Driven Analysis for Image Caption Generation

Image Captioning with an Intermediate Attributes Layer.

Region-Aware Image Captioning Via Interaction Learning

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

A Cooperative Approach Based on Self-Attention with Interactive Attribute for Image Caption

Positional Self-attention Based Hierarchical Image Captioning.

Remote Sensing Image Captioning Based on Multi-Level Feature Extraction and Adaptive Attention

Aligning Where to See and What to Tell: Image Caption with Region-Based Attention and Scene Factorization

Image Caption Generation with High-Level Image Features

Diverse and Controllable Image Captioning with Part-of-Speech Guidance.

Reference Based On Adaptive Attention Mechanism For Image Captioning

Attribute-Driven Filtering: A New Attributes Predicting Approach for Fine-Grained Image Captioning

Improving Image Captioning with Better Use of Caption