ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

Kanzhi Cheng,Zheng Ma,Shi Zong,Jianbing Zhang,Xinyu Dai,Jiajun Chen
2023-08-02
Abstract:Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns. In this paper, we propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to align the image and text features, which unifies paired factual and unpaired stylistic corpora during the training process. A conditional variational auto-encoder is then used to automatically memorize diverse stylistic patterns in latent space and enhance diversity through sampling. We also design a simple but effective recheck module to boost style accuracy by filtering style-specific captions. Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances compared to various baselines. We finally conduct extensive analyses to understand the effectiveness of our method. Our code is available at <a class="link-external link-https" href="https://github.com/njucckevin/ADS-Cap" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are two key challenges encountered when generating image captions with a specific language style: 1. **Maintaining consistency with the image when training with unpaired style corpora**: Most previous works pre - train the model on large - scale factual image - caption pairs and then fine - tune it on a small - scale monolingual text corpus to incorporate a specific language style. However, this practice causes the model to focus more on the language style, thus reducing the consistency with the image. Some efforts attempt to find a medium between vision and text, such as semantic terms or scene graphs, but these methods can degrade performance due to conversion errors. 2. **The ability to generate diverse style patterns**: Many previous works have ignored the diversity of style patterns. For example, when dealing with different images of similar scenes, the baseline model tends to generate the same normal - style phrases (such as "to meet his lover"), which significantly deviates from the purpose of the stylized image captioning task. Moreover, style corpora are usually small in scale, making it difficult for existing methods to generate multiple style patterns. To solve these problems, the paper proposes an end - to - end framework (ADS - Cap) that uses contrastive learning and conditional variational auto - encoders to generate accurate and diverse stylized image captions. Specifically, the framework addresses the above challenges in the following ways: - **Contrastive learning module**: Used to unify the training processes of paired factual image - caption pairs and unpaired style corpora. By aligning image features and object - word features through contrastive learning, the consistency with the image will not be reduced when fine - tuning on unpaired style corpora. - **Conditional variational auto - encoder (CVAE)**: Used to automatically memorize diverse style patterns during the training stage and enhance the generated diversity by sampling different latent variables during the inference process. - **Re - examination module**: A simple but effective re - examination module is designed to improve style accuracy by filtering out captions of a specific style from the candidate set. Experimental results show that ADS - Cap performs well on multiple datasets, especially outperforming various baseline methods in terms of image consistency, style accuracy, and diversity.