Abstract:Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. However, most existing methods construct attention mechanisms based on singular visual features, such as patch features or object features, which limits the accuracy of generated captions. In this article, we propose a Bidirectional Co-Attention Network (BCAN) that combines multiple visual features to provide information from different aspects. Different features are associated with predicting different words, and there are a priori relations between these multiple visual features. Based on this, we further propose a bottom-up and top-down bi-directional co-attention mechanism to extract discriminative attention information. Furthermore, most existing methods do not exploit an effective multimodal integration strategy, generally using addition or concatenation to combine features. To solve this problem, we adopt the Multivariate Residual Module (MRM) to integrate multimodal attention features. Meanwhile, we further propose a Vertical MRM to integrate features of the same category, and a Horizontal MRM to combine features of the different categories, which can balance the contribution of the bottom-up co-attention and the top-down co-attention. In contrast to the existing methods, the BCAN is able to obtain complementary information from multiple visual features via the bi-directional co-attention strategy, and integrate multimodal information via the improved multivariate residual strategy. We conduct a series of experiments on two benchmark datasets (MSCOCO and Flickr30k), and the results indicate that the proposed BCAN achieves the superior performance.

CSDNet: Cross-Sketch with Dual Gated Attention for Fine-Grained Image Captioning Network

CASCADE ATTENTION FUSION FOR FINE-GRAINED IMAGE CAPTIONING BASED ON MULTI-LAYER LSTM

GVA: guided visual attention approach for automatic image caption generation

Deliberate Attention Networks for Image Captioning

Dynamic-balanced Double-Attention Fusion for Image Captioning

Exploring Visual Relationship for Image Captioning

CMGNet: Collaborative multi-modal graph network for video captioning

CaptionNet: Automatic End-to-End Siamese Difference Captioning Model with Attention

A Cooperative Approach Based on Self-Attention with Interactive Attribute for Image Caption

Delving Into Precise Attention In Image Captioning

GateCap: Gated Spatial and Semantic Attention Model for Image Captioning

Image caption generation with dual attention mechanism

Dual-level Collaborative Transformer for Image Captioning

Exploring refined dual visual features cross-combination for image captioning

Bi-Directional Co-Attention Network for Image Captioning

Divided Caption Model with Global Attention

MAENet: A Novel Multi-Head Association Attention Enhancement Network for Completing Intra-Modal Interaction in Image Captioning

Learning visual relationship and context-aware attention for image captioning

Cross modification attention-based deliberation model for image captioning

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Local-global Visual Interaction Attention for Image Captioning