Diffusion Models as Masked Audio-Video Learners

Elvis Nunez,Yanzi Jin,Mohammad Rastegari,Sachin Mehta,Maxwell Horton
2024-01-05
Abstract:Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance.
Sound,Computer Vision and Pattern Recognition,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the efficiency of audio - video pre - training models while maintaining or enhancing their performance in downstream tasks. Specifically, the paper explores how to combine diffusion models with the existing Masked Audio - Video Learning (MA ViL) framework to achieve a more efficient pre - training process. By introducing diffusion techniques and adopting strategies such as the dynamic masking ratio and adaptive batch size, the paper aims to reduce the number of floating - point operations (FLOPS) required for pre - training and the actual training time without sacrificing the performance of the model in downstream audio classification tasks. ### Main Contributions 1. **Diffusion - enhanced Audio - video Pre - training**: The paper shows that diffusion - based masked audio - video pre - training can promote the learning of rich audio - video representations in downstream audio classification tasks without sacrificing performance. 2. **Improving Pre - training Efficiency**: By using cross - attention mechanisms instead of self - attention mechanisms and studying the masking ratio curriculum and dynamic batch size, the paper significantly reduces the pre - training FLOPS (by 32%) and the actual training time (by 18%) while maintaining the accuracy of the model. ### Experimental Results - **Performance Comparison**: The experimental results on multiple datasets show that DiffMA ViL significantly improves the training efficiency while maintaining performance comparable to that of MA ViL. - **Ablation Experiments**: By replacing the self - attention mechanism with the cross - attention mechanism, using the dynamic masking ratio, and introducing the adaptive batch size, the paper analyzes in detail the impact of these strategies on pre - training efficiency. ### Conclusion By integrating diffusion techniques into the MA ViL framework and combining multiple strategies to improve training efficiency, the paper successfully achieves a significant improvement in pre - training efficiency while maintaining the performance of the model in downstream tasks. This result is of great significance for large - scale audio - video pre - training and helps to reduce the demand for computational resources and accelerate the development and application of models.