Abstract:Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first train a multi-modal encoder to align audio representation with the pre-trained textual manifold and visual manifold, respectively. Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly. In this way, we are able to extract the dynamic information of varied sounds, while utilizing the formidable capability of existing T2I models to facilitate sound-guided image generation, editing, and stylization in a convenient and cost-effective manner. The experiment results confirm that our proposed AAI outperforms other text and sound-guided state-of-the-art methods. And our aligned multi-modal encoder is also competitive with other approaches in the audio-visual retrieval and audio-text retrieval tasks.

Learning Visual Styles from Audio-Visual Associations

Self-Supervised Audio-Visual Soundscape Stylization

Artistic Style Transfer with Internal-external Learning and Contrastive Learning

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

VisibleSound: Perceiving Environmental Sound with 4D Form

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Align, Adapt and Inject: Sound-guided Unified Image Generation

Content and Style Aware Audio-Driven Facial Animation

Curriculum Audiovisual Learning

Sound Transformation: Applying Image Neural Style Transfer Networks to Audio Spectograms

Crossing You in Style: Cross-modal Style Transfer from Music to Visual Arts

Images that Sound: Composing Images and Sounds on a Single Canvas

Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Audio-Visual Class-Incremental Learning

Style Transfer for Non-differentiable Audio Effects

Vision-Infused Deep Audio Inpainting

Visual to Sound: Generating Natural Sound for Videos in the Wild

Audio-Visual Model Distillation Using Acoustic Images

Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement

Measuring Sound Symbolism in Audio-visual Models