Abstract:Recent artificial intelligence research has witnessed great interest in automatically generating text descriptions of images, which are known as the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">image captioning</italic> task. Remarkable success has been achieved on domains where a large number of paired data in multimedia are available. Nevertheless, annotating sufficient data is labor-intensive and time-consuming, establishing significant barriers for adapting the image captioning systems to new domains. In this study, we introduc a novel Multitask Learning Algorithm for cross-Domain Image Captioning (MLADIC). MLADIC is a multitask system that simultaneously optimizes two coupled objectives via a dual learning mechanism: image captioning and text-to-image synthesis, with the hope that by leveraging the correlation of the two dual tasks, we are able to enhance the image captioning performance in the target domain. Concretely, the image captioning task is trained with an encoder–decoder model (i.e., CNN-LSTM) to generate textual descriptions of the input images. The image synthesis task employs the conditional generative adversarial network (C-GAN) to synthesize plausible images based on text descriptions. In C-GAN, a generative model <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$G$</tex-math></inline-formula> synthesizes plausible images given text descriptions, and a discriminative model <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$D$</tex-math></inline-formula> tries to distinguish the images in training data from the generated images by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$G$</tex-math></inline-formula> . The adversarial process can eventually guide <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$G$</tex-math></inline-formula> to generate plausible and high-quality images. To bridge the gap between different domains, a two-step strategy is adopted in order to transfer knowledge from the source domains to the target domains. First, we pre-train the model to learn the alignment between the neural representations of images and that of text data with the sufficient labeled source domain data. Second, we fine-tune the learned model by leveraging the limited image–text pairs and unpaired data in the target domain. We conduct extensive experiments to evaluate the performance of MLADIC by using the MSCOCO as the source domain data, and using Flickr30k and Oxford-102 as the target domain data. The results demonstrate that MLADIC achieves substantially better performance than the strong competitors for the cross-domain image captioning task.

CODP-1200: An AIGC based benchmark for assisting in child language acquisition

PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition

Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Image Captioning in Chinese and Its Application for Children with Autism Spectrum Disorder

Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration

PSCR: Patches Sampling-based Contrastive Regression for AIGC Image Quality Assessment

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI

COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

PKU-AIGI-500K: A Neural Compression Benchmark and Model for AI-Generated Images

Exploring Discrete Diffusion Models for Image Captioning

AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

AI-Generated Content (AIGC) for Various Data Modalities: A Survey

Multitask Learning for Cross-Domain Image Captioning

Auto-Encoding and Distilling Scene Graphs for Image Captioning

Enhancing Image Captioning Using Deep Convolutional Generative Adversarial Networks

Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap

Image Captioning with Adaptive Incremental Global Context Attention

AICoderEval: Improving AI Domain Code Generation of Large Language Models

Evaluating AIGC Detectors on Code Content