Abstract:Image-to-music generation aims to generate realistic pure music according to a given image. Although many previous works are conducted on bridging image and music, they mainly focus on the content-based cross-modal matching. For example, matching the Christmas song to an image that contains a Christmas tree. By comparison, image-to-music generation is a more challenging task due to its ambiguity and subjectivity. Specifically, there is no explicit correlation between the image content and music melody, without any lyric and human sound. Meanwhile, the perception of generated music varies from person to person. Inspired by the synesthesia phenomenon, we think that if an image tends to elicit a certain emotion on human, the generated music should also leave a similar impression. Therefore, in this paper, we propose a continuous emotion-based image-to-music generation framework, which uses emotion as the key for cross-modal generation. Specifically, a new image-music dataset is established, which uses valence-arousal (VA) space to capture the complex and nuanced nature of emotions. After that, a plug and play model is proposed to translate an image into a piece of music with similar emotion, which projects the emotions into continuous-valued labels, and explores both the intra-modal and inter-modal emotional consistency with contrastive learning. To our best knowledge, this is the first end-to-end framework towards the task of pure music generation from natural images. Extensive experiments show that the generated music achieves satisfactory emotional consistency with the input images, as well as impressive quality.

Emotion-Aligned Contrastive Learning Between Images and Music

Image–Music Synesthesia-Aware Learning Based on Emotional Similarity Recognition

Emotion-Driven Chinese Folk Music-Image Retrieval Based on De-Svm

Human-centric Music Medical Therapy Exploration System

Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

Emotion Embedding Spaces for Matching Music to Stories

Learning Affective Correspondence between Music and Image

Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Study on Linguistic Computing for Music Emotion

Emotion-Guided Image to Music Generation

Continuous Emotion-Based Image-to-Music Generation

Fine-grained Sentiment Semantic Analysis and Matching of Music and Image

Music recommendation based on affective image content analysis

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings

Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings

Contrastive Learning for Cross-modal Artist Retrieval

Joint Learning of Emotions in Music and Generalized Sounds

Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation

Contrastive Audio-Language Learning for Music

Emotion Based Image Musicalization