Abstract:Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC) branch. The CdVAE branch maps each image caption to one of the mode embeddings stored in a learned codebook, and is trained with a pure non-autoregressive generation objective to make the modes distinct and representative. The MIC branch can be simply modified from an existing image captioning model, where the mode embedding is added to the original word embeddings as the control signal. In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset.

Style-aware Two-Stage Learning Framework for Video Captioning

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Learning Multimodal Attention LSTM Networks for Video Captioning.

ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

Adaptive Curriculum Learning for Video Captioning.

Multi-scale features with temporal information guidance for video captioning

Stacked Multimodal Attention Network for Context-Aware Video Captioning

Learning Distinct and Representative Styles for Image Captioning

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

Learning Video-Text Aligned Representations for Video Captioning

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Discriminative Style Learning for Cross-Domain Image Captioning

StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

Say Anything with Any Style

Delving Deeper into the Decoder for Video Captioning

Subject-Oriented Video Captioning

SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Weakly Supervised Dense Video Captioning

Non-Autoregressive Coarse-to-Fine Video Captioning

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention