Abstract:Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the efficiency of audio - video pre - training models while maintaining or enhancing their performance in downstream tasks. Specifically, the paper explores how to combine diffusion models with the existing Masked Audio - Video Learning (MA ViL) framework to achieve a more efficient pre - training process. By introducing diffusion techniques and adopting strategies such as the dynamic masking ratio and adaptive batch size, the paper aims to reduce the number of floating - point operations (FLOPS) required for pre - training and the actual training time without sacrificing the performance of the model in downstream audio classification tasks. ### Main Contributions 1. **Diffusion - enhanced Audio - video Pre - training**: The paper shows that diffusion - based masked audio - video pre - training can promote the learning of rich audio - video representations in downstream audio classification tasks without sacrificing performance. 2. **Improving Pre - training Efficiency**: By using cross - attention mechanisms instead of self - attention mechanisms and studying the masking ratio curriculum and dynamic batch size, the paper significantly reduces the pre - training FLOPS (by 32%) and the actual training time (by 18%) while maintaining the accuracy of the model. ### Experimental Results - **Performance Comparison**: The experimental results on multiple datasets show that DiffMA ViL significantly improves the training efficiency while maintaining performance comparable to that of MA ViL. - **Ablation Experiments**: By replacing the self - attention mechanism with the cross - attention mechanism, using the dynamic masking ratio, and introducing the adaptive batch size, the paper analyzes in detail the impact of these strategies on pre - training efficiency. ### Conclusion By integrating diffusion techniques into the MA ViL framework and combining multiple strategies to improve training efficiency, the paper successfully achieves a significant improvement in pre - training efficiency while maintaining the performance of the model in downstream tasks. This result is of great significance for large - scale audio - video pre - training and helps to reduce the demand for computational resources and accelerate the development and application of models.

Diffusion Models as Masked Audio-Video Learners

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

MAViL: Masked Audio-Video Learners

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Contrastive Audio-Visual Masked Autoencoder

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Simple and Effective Masked Diffusion Language Models

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

LMD: Faster Image Reconstruction with Latent Masking Diffusion

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Audio-visual voice activity detection using diffusion maps

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.

Diffusion Models as Masked Autoencoders

Diffusion-based Unsupervised Audio-visual Speech Enhancement

Diffusion Models for Video Prediction and Infilling