MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Sanjoy Chowdhury,Sayan Nag,K J Joseph,Balaji Vasan Srinivasan,Dinesh Manocha
2024-06-07
Abstract:Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to synthesize music by combining images and text prompts. Specifically, existing music generation models mainly rely on text descriptions for conditional generation, ignoring the rich visual information in images. This paper proposes a new method - MELFUSION, which can utilize the semantic information of both text and images to generate higher - quality and more context - appropriate music. ### Main problems: 1. **Limitations of existing methods**: Most current music generation models are only conditionally generated based on text descriptions, ignoring the visual information in images. 2. **Need for multimodal fusion**: In order to better capture the emotion and atmosphere of a scene, a method that can combine image and text information is required. 3. **Lack of datasets and evaluation criteria**: There is a lack of datasets that include pairings of images, text, and music, as well as effective indicators for evaluating such multimodal generation tasks. ### Solutions: - **MELFUSION model**: This model introduces a "visual synapse" mechanism to effectively incorporate the semantic information of images into the music generation process. - **MeLBench dataset**: To promote research, the authors created a new dataset, MeLBench, which contains 11,250 image - text - music triplets. - **IMSM evaluation metric**: A new evaluation metric, IMSM (Image Music Similarity Metric), was proposed to measure the consistency between the generated music and the image. ### Main contributions of the paper: 1. **Task definition**: Formally defined a new task of generating music by combining images and text. 2. **Innovative model**: Proposed MELFUSION, a diffusion model that combines images and text. 3. **New dataset**: Released the MeLBench dataset, which contains a large number of image - text - music triplets. 4. **New evaluation metric**: Introduced IMSM for quantitatively evaluating the consistency between images and music. 5. **Experimental results**: Through extensive experimental verification, MELFUSION significantly outperforms existing methods in both subjective and objective evaluations, especially with a 67.98% relative improvement in the FAD score. ### Formula summary: - **Attention mechanism formula**: \[ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] - **Forward diffusion process**: \[ q(z_M^t | z_M^{t - 1}) = \mathcal{N}(z_M^t; \sqrt{1 - \beta_t} z_M^{t - 1}, \beta_t I) \] - **Reverse diffusion process**: \[ q(z_M^t | z_M^1) = \mathcal{N}(z_M^t; \sqrt{\bar{\gamma}_t} z_M^1, (1 - \bar{\gamma}_t)I) \] where \(\bar{\gamma}_t=\prod_{r = 0}^t\gamma_r\), \(\gamma_t = 1-\beta_t\). - **Feature fusion formula**: \[ K_M^l=\alpha_l K_I^l+(1 - \alpha_l)K_M^l \] \[ V_M^l=\alpha_l V_I^l+(1 - \alpha_l)V_M^l \] Through these innovations, MELFUSION can better utilize the complementary information of images and text when generating music, thereby generating higher - quality music that is more in line with the context.