Abstract:Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at \url{<a class="link-external link-https" href="https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet" rel="external noopener nofollow">this https URL</a>}.

Diffusion-Cap: A Diffusion Model for Image Captioning

DiffCap: Exploring Continuous Diffusion on Image Captioning

Exploring Discrete Diffusion Models for Image Captioning

CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Semantic-Conditional Diffusion Networks for Image Captioning

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

Towards Diverse and Efficient Audio Captioning via Diffusion Models

DECap: Towards Generalized Explicit Caption Editing Via Diffusion Mechanism

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Out-of-Distribution with Text-to-Image Diffusion Models.

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Diffusion-Adapter: Text Guided Image Manipulation with Frozen Diffusion Models

Are Diffusion Models Vision-And-Language Reasoners?

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Dual Diffusion for Unified Image Generation and Understanding

ECNet: Effective Controllable Text-to-Image Diffusion Models