Abstract:In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic <a class="link-external link-http" href="http://space.In" rel="external noopener nofollow">this http URL</a> the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.

An investigation on selecting audio pre-trained models for audio captioning

Exploring the Role of Audio in Video Captioning

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Caption Feature Space Regularization for Audio Captioning

Revisiting Pre-training in Audio-Visual Learning

Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders

Zero-Shot Audio Captioning Using Soft and Hard Prompts

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Tuning In: Analysis of Audio Classifier Performance in Clinical Settings with Limited Data

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

Diversity-Controllable and Accurate Audio Captioning Based on Neural Condition

Measuring Sound Symbolism in Audio-visual Models

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Utilizing Self-supervised Representations for MOS Prediction

Audio Caption: Listen And Tell

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation