CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Kalliopi Basioti,Mohamed A. Abdelsalam,Federico Fancellu,Vladimir Pavlovic,Afsaneh Fazly

2024-07-18

Abstract:Controllable Image Captioning (CIC) aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities or events of interest. However, available image-language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current methods. We use this Structured Semantic Augmentation (SSA) framework to augment existing image-caption datasets with the grounded controlled captions, increasing their spatial and semantic diversity and focal coverage. We then develop a new model, CIC-BART-SSA, specifically tailored for the CIC task, that sources its control signals from SSA-diversified datasets. We empirically show that, compared to SOTA CIC models, CIC-BART-SSA generates captions that are superior in diversity and text quality, are competitive in controllability, and, importantly, minimize the gap between broad and highly focused controlled captioning performance by efficiently generalizing to the challenging highly focused scenarios. Code is available at <a class="link-external link-https" href="https://github.com/SamsungLabs/CIC-BART-SSA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses several key issues in Controllable Image Captioning (CIC): 1. **Diversity and Controllability**: - Existing image captioning datasets typically contain captions that describe the entire image, lacking diversity and controllability. - The proposed method aims to generate diverse and controllable captions that can produce different descriptions based on user-specified signals (such as areas of interest, length, etc.). 2. **Data Augmentation Techniques**: - The Structured Semantic Augmentation (SSA) method is proposed, which uses Abstract Meaning Representation (AMR) to automatically generate captions with different spatial semantic focuses. - This method can increase the spatial and semantic diversity of existing datasets and improve the coverage of specific scenes. 3. **Model Design**: - A new CIC-BART-SSA model is designed specifically for the task of controllable image captioning. - The model can generate high-quality and diverse captions while maintaining good responsiveness to control signals. 4. **Simplified Control Signals**: - It does not rely on complex control signals (such as detailed syntactic structures) but uses simple control signals (such as areas of interest and expected caption length) to achieve a balance between practicality and performance. Through these improvements, the proposed method in the paper demonstrates higher diversity and text quality in the task of controllable image captioning while maintaining good controllability.

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

IC3: Image Captioning by Committee Consensus

Unpaired Image Captioning With semantic-Constrained Self-Learning

Structural Semantic Adversarial Active Learning for Image Captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

CIC: A framework for Culturally-aware Image Captioning

Improving Image Captioning with Better Use of Caption

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Improving Image Captioning with Better Use of Captions

Controllable image caption with an encoder-decoder optimization structure

ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

Improving Multimodal Datasets with Image Captioning

Semantic-Driven Saliency-Context Separation for Video Captioning

Aesthetic Image Captioning From Weakly-Labelled Photographs

Adversarial Semantic Alignment for Improved Image Captions

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

OSIC: A New One-Stage Image Captioner Coined