Abstract:Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns. In this paper, we propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to align the image and text features, which unifies paired factual and unpaired stylistic corpora during the training process. A conditional variational auto-encoder is then used to automatically memorize diverse stylistic patterns in latent space and enhance diversity through sampling. We also design a simple but effective recheck module to boost style accuracy by filtering style-specific captions. Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances compared to various baselines. We finally conduct extensive analyses to understand the effectiveness of our method. Our code is available at <a class="link-external link-https" href="https://github.com/njucckevin/ADS-Cap" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are two key challenges encountered when generating image captions with a specific language style: 1. **Maintaining consistency with the image when training with unpaired style corpora**: Most previous works pre - train the model on large - scale factual image - caption pairs and then fine - tune it on a small - scale monolingual text corpus to incorporate a specific language style. However, this practice causes the model to focus more on the language style, thus reducing the consistency with the image. Some efforts attempt to find a medium between vision and text, such as semantic terms or scene graphs, but these methods can degrade performance due to conversion errors. 2. **The ability to generate diverse style patterns**: Many previous works have ignored the diversity of style patterns. For example, when dealing with different images of similar scenes, the baseline model tends to generate the same normal - style phrases (such as "to meet his lover"), which significantly deviates from the purpose of the stylized image captioning task. Moreover, style corpora are usually small in scale, making it difficult for existing methods to generate multiple style patterns. To solve these problems, the paper proposes an end - to - end framework (ADS - Cap) that uses contrastive learning and conditional variational auto - encoders to generate accurate and diverse stylized image captions. Specifically, the framework addresses the above challenges in the following ways: - **Contrastive learning module**: Used to unify the training processes of paired factual image - caption pairs and unpaired style corpora. By aligning image features and object - word features through contrastive learning, the consistency with the image will not be reduced when fine - tuning on unpaired style corpora. - **Conditional variational auto - encoder (CVAE)**: Used to automatically memorize diverse style patterns during the training stage and enhance the generated diversity by sampling different latent variables during the inference process. - **Re - examination module**: A simple but effective re - examination module is designed to improve style accuracy by filtering out captions of a specific style from the candidate set. Experimental results show that ADS - Cap performs well on multiple datasets, especially outperforming various baseline methods in terms of image consistency, style accuracy, and diversity.

ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Towards Accurate Text-based Image Captioning with Content Diversity Exploration

Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning

Style-aware Two-Stage Learning Framework for Video Captioning

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

StyleAdapter: A Unified Stylized Image Generation Model

Discriminative Style Learning for Cross-Domain Image Captioning

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Advanced Generative Deep Learning Techniques for Accurate Captioning of Images

Similar Scenes Arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Fast, Diverse and Accurate Image Captioning Guided by Part-of-Speech

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Structural Semantic Adversarial Active Learning for Image Captioning

Diverse and Controllable Image Captioning with Part-of-Speech Guidance.