Abstract:The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could offer accurate subregions to train a model, the learned captioner may predict wrong, especially for visual concept words, which are the most important parts to understand an image. To tackle the preceding problem, in this article we propose Visual Concept Enhanced Captioner, which employs a joint attention mechanism with visual concept samples to strengthen prediction abilities for visual concepts in image captioning. Different from traditional attention approaches that adopt one LSTM to explore one noticed subregion each time, Visual Concept Enhanced Captioner introduces multiple virtual LSTMs in parallel to simultaneously receive multiple subregions from visual concept samples. Then, the model could update parameters by jointly exploring these subregions according to a composite loss function. Technically, this joint learning is helpful in finding the common characters of a visual concept, and thus it enhances the prediction accuracy for visual concepts. Moreover, by integrating diverse visual concept samples from different domains, our model can be extended to bridge visual bias in cross-domain learning for image captioning, which saves the cost for labeling captions. Extensive experiments have been conducted on two image datasets (MSCOCO and Flickr30K), and superior results are reported when comparing to state-of-the-art approaches. It is impressive that our approach could significantly increase BLUE-1 and F1 scores, which demonstrates an accuracy improvement for visual concepts in image captioning.

Image Caption with Endogenous–Exogenous Attention

Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation

Image Captioning Based on Sentence-Level and Word-Level Attention

An Image Captioning Algorithm Based on Combination Attention Mechanism

Hybrid Attention Network for Image Captioning

Stimulus-driven and Concept-Driven Analysis for Image Caption Generation

An Image Captioning Approach Using Dynamical Attention.

Looking Deeper and Transferring Attention for Image Captioning.

Topic-Guided Attention for Image Captioning

Image captioning with weakly-supervised attention penalty

INSTANCE-AWARE REMOTE SENSING IMAGE CAPTIONING WITH CROSS-HIERARCHY ATTENTION

Spatial- Temporal Attention for Image Captioning

Improve Image Captioning By Self-Attention

Learning visual relationship and context-aware attention for image captioning

A Multi-Level Attention Model For Remote Sensing Image Captions

A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

Image Caption Generation with High-Level Image Features

Combining Object-Based Attention And Attributes For Image Captioning

Time-Dependent Pre-Attention Model For Image Captioning

Image Captioning with a Joint Attention Mechanism by Visual Concept Samples