Abstract:In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic <a class="link-external link-http" href="http://space.In" rel="external noopener nofollow">this http URL</a> the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.

Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

MeaCap: Memory-Augmented Zero-shot Image Captioning

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Zero-shot audio captioning with audio-language model guidance and audio context keywords

Accurate and Complete Captions for Question-controlled Text-aware Image Captioning

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Zero-Shot Audio Captioning Using Soft and Hard Prompts

ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing

CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Text-only Synthesis for Image Captioning

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Image2Text: A multimodal caption generator

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Text Augmented Spatial-aware Zero-shot Referring Image Segmentation