Abstract:Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. Code available: <a class="link-external link-https" href="https://github.com/Labbeti/conette-audio-captioning" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in Automated Audio Captioning (AAC): 1. **Balance between model performance and the number of parameters**: Although existing AAC models perform well on some datasets, they often require a large number of parameters, which not only increases the demand for computational resources but may also lead to over - fitting. The paper proposes a new model architecture - CNext - trans, which uses ConvNeXt adapted from the visual field as an audio encoder and combines it with a simple Transformer decoder, achieving the maintenance or improvement of performance while reducing the number of parameters. 2. **Cross - dataset generalization ability**: Existing AAC models are usually trained on a specific dataset and then tested on the same dataset, resulting in poor performance on other datasets. The paper enhances the model's cross - dataset generalization ability by introducing Task Embedding (TE), enabling the model to identify which dataset the input sample comes from, and thus generate descriptions that are more in line with the style of that dataset. 3. **Dataset bias problem**: Since the AudioCaps (AC) dataset is extracted from the AudioSet (AS) training set, using an audio encoder pre - trained on AS may cause the model's performance on AC to be affected by bias. The paper explores the impact of using an unbiased encoder on performance and proposes a method of training by combining multiple datasets to mitigate this bias. 4. **Effectiveness of multi - dataset joint training**: The paper studies how to improve the overall performance of the model by combining multiple AAC datasets (such as AC, Clotho, MACS, and WavCaps) for training. Although this method improves the model's performance on each dataset to a certain extent, it still cannot achieve the effect of a model trained on a single target dataset, indicating that there is currently no universal model applicable to all datasets. In summary, the main objective of this paper is to solve the problems of existing AAC systems in terms of performance, generalization ability, and dataset bias by improving the model architecture, introducing task embedding, and optimizing the training strategy.

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Multilingual Audio Captioning using machine translated data

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

A Study of ConvNeXt Architectures for Enhanced Image Captioning

Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning

Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Adapting a ConvNeXt model to audio classification on AudioSet

Training Audio Captioning Models without Audio

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

CLIP4Caption ++: Multi-CLIP for Video Caption

RECAP: Retrieval-Augmented Audio Captioning

Weakly-supervised Automated Audio Captioning via text only training

Automated Audio Captioning with Recurrent Neural Networks

Improving Multimodal Datasets with Image Captioning

Audio-Visual Efficient Conformer for Robust Speech Recognition