Abstract:Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to synthesize music by combining images and text prompts. Specifically, existing music generation models mainly rely on text descriptions for conditional generation, ignoring the rich visual information in images. This paper proposes a new method - MELFUSION, which can utilize the semantic information of both text and images to generate higher - quality and more context - appropriate music. ### Main problems: 1. **Limitations of existing methods**: Most current music generation models are only conditionally generated based on text descriptions, ignoring the visual information in images. 2. **Need for multimodal fusion**: In order to better capture the emotion and atmosphere of a scene, a method that can combine image and text information is required. 3. **Lack of datasets and evaluation criteria**: There is a lack of datasets that include pairings of images, text, and music, as well as effective indicators for evaluating such multimodal generation tasks. ### Solutions: - **MELFUSION model**: This model introduces a "visual synapse" mechanism to effectively incorporate the semantic information of images into the music generation process. - **MeLBench dataset**: To promote research, the authors created a new dataset, MeLBench, which contains 11,250 image - text - music triplets. - **IMSM evaluation metric**: A new evaluation metric, IMSM (Image Music Similarity Metric), was proposed to measure the consistency between the generated music and the image. ### Main contributions of the paper: 1. **Task definition**: Formally defined a new task of generating music by combining images and text. 2. **Innovative model**: Proposed MELFUSION, a diffusion model that combines images and text. 3. **New dataset**: Released the MeLBench dataset, which contains a large number of image - text - music triplets. 4. **New evaluation metric**: Introduced IMSM for quantitatively evaluating the consistency between images and music. 5. **Experimental results**: Through extensive experimental verification, MELFUSION significantly outperforms existing methods in both subjective and objective evaluations, especially with a 67.98% relative improvement in the FAD score. ### Formula summary: - **Attention mechanism formula**: \[ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] - **Forward diffusion process**: \[ q(z_M^t | z_M^{t - 1}) = \mathcal{N}(z_M^t; \sqrt{1 - \beta_t} z_M^{t - 1}, \beta_t I) \] - **Reverse diffusion process**: \[ q(z_M^t | z_M^1) = \mathcal{N}(z_M^t; \sqrt{\bar{\gamma}_t} z_M^1, (1 - \bar{\gamma}_t)I) \] where \(\bar{\gamma}_t=\prod_{r = 0}^t\gamma_r\), \(\gamma_t = 1-\beta_t\). - **Feature fusion formula**: \[ K_M^l=\alpha_l K_I^l+(1 - \alpha_l)K_M^l \] \[ V_M^l=\alpha_l V_I^l+(1 - \alpha_l)V_M^l \] Through these innovations, MELFUSION can better utilize the complementary information of images and text when generating music, thereby generating higher - quality music that is more in line with the context.

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Image–Music Synesthesia-Aware Learning Based on Emotional Similarity Recognition

Efficient Neural Music Generation

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models

FLUX that Plays Music

Multi-Source Music Generation with Latent Diffusion

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Emotion Manipulation Through Music -- A Deep Learning Interactive Visual Approach

ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition

High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

MusicLM: Generating Music From Text

Music Consistency Models

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Noise2Music: Text-conditioned Music Generation with Diffusion Models

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Msanii: High Fidelity Music Synthesis on a Shoestring Budget

Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls

Foundation Models for Music: A Survey

MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence