Abstract:Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches have achieved significant improvement, the audio modality is often ignored in video captioning. In this work, we present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. Instead of relying on text transcripts extracted via automatic speech recognition (ASR), we argue that learning with raw audio signals can be more beneficial, as audio has additional information including acoustic events, speaker identity, etc. Our contributions are twofold. First, we observed that the model overspecializes to the audio modality when pre-training with both video and audio modality, since the ground truth (i.e., text transcripts) can be solely predicted using audio. We proposed a Modality Balanced Pre-training (MBP) loss to mitigate this issue and significantly improve the performance on downstream tasks. Second, we slice and dice different design choices of the cross-modal module, which may become an information bottleneck and generate inferior results. We proposed new local-global fusion mechanisms to improve information exchange across audio and video. We demonstrate significant improvements by leveraging the audio modality on four datasets, and even outperform the state of the art on some metrics without relying on the text modality as the input.

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Icnn-Transformer: an Improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning

An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning

Improving Audio Caption Fluency with Automatic Error Correction

Exploring the Role of Audio in Video Captioning

AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Audio Captioning Based on Transformer and Pre-Trained CNN.

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Improving Image Captioning with Better Use of Caption

Adaptive semantic guidance network for video captioning