Abstract:Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at

What problem does this paper attempt to address?

The paper primarily aims to address the following issues: 1. **Integration of Cross-Modal Generation Techniques**: Existing technologies have made progress in video and audio generation separately, but there is a lack of a method that can effectively combine visual and auditory content, especially to achieve high-quality visual-auditory joint generation in open-domain scenarios. 2. **Consistency and Relevance of Multimodal Content**: The content generated by existing single-modal generation models (such as text-to-video, text-to-audio, etc.) often lacks consistent cross-modal support. For example, videos may not have corresponding audio, or audio may lack synchronized visual effects. 3. **Efficient Utilization of Pre-trained Models**: The paper proposes a method that leverages existing high-performance single-modal generation models and connects them through a novel alignment strategy, rather than training large models from scratch, thereby reducing resource consumption. Specifically, the authors propose a new framework called the "Diffusion Latent Aligner," which can: - Utilize pre-trained multimodal models (such as ImageBind) as a bridge to connect and integrate different single-modal generation models in a shared semantic space. - Achieve multiple tasks such as visual-to-audio (V2A), image-to-audio (I2A), audio-to-video (A2V), and joint generation of video and audio (Joint-V A). - Improve the quality and consistency of generated content through optimization strategies and loss functions, without the need for large-scale datasets, by achieving progressive alignment between different modalities. The experimental section validates the effectiveness and superiority of the proposed method, including comparisons with baseline methods and performance evaluations on various tasks.

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

AudioVSR: Enhancing Video Speech Recognition with Audio Data

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Video-to-Audio Generation with Hidden Alignment

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Video-to-Audio Generation with Fine-grained Temporal Semantics

Align, Adapt and Inject: Sound-guided Unified Image Generation

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment