Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Yazhou Xing,Yingqing He,Zeyue Tian,Xintao Wang,Qifeng Chen
2024-02-28
Abstract:Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at
Computer Vision and Pattern Recognition,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily aims to address the following issues: 1. **Integration of Cross-Modal Generation Techniques**: Existing technologies have made progress in video and audio generation separately, but there is a lack of a method that can effectively combine visual and auditory content, especially to achieve high-quality visual-auditory joint generation in open-domain scenarios. 2. **Consistency and Relevance of Multimodal Content**: The content generated by existing single-modal generation models (such as text-to-video, text-to-audio, etc.) often lacks consistent cross-modal support. For example, videos may not have corresponding audio, or audio may lack synchronized visual effects. 3. **Efficient Utilization of Pre-trained Models**: The paper proposes a method that leverages existing high-performance single-modal generation models and connects them through a novel alignment strategy, rather than training large models from scratch, thereby reducing resource consumption. Specifically, the authors propose a new framework called the "Diffusion Latent Aligner," which can: - Utilize pre-trained multimodal models (such as ImageBind) as a bridge to connect and integrate different single-modal generation models in a shared semantic space. - Achieve multiple tasks such as visual-to-audio (V2A), image-to-audio (I2A), audio-to-video (A2V), and joint generation of video and audio (Joint-V A). - Improve the quality and consistency of generated content through optimization strategies and loss functions, without the need for large-scale datasets, by achieving progressive alignment between different modalities. The experimental section validates the effectiveness and superiority of the proposed method, including comparisons with baseline methods and performance evaluations on various tasks.