Abstract:Generating music with emotion similar to that of an input video is a very relevant issue nowadays. Video content creators and automatic movie directors benefit from maintaining their viewers engaged, which can be facilitated by producing novel material eliciting stronger emotions in them. Moreover, there's currently a demand for more empathetic computers to aid humans in applications such as augmenting the perception ability of visually and/or hearing impaired people. Current approaches overlook the video's emotional characteristics in the music generation step, only consider static images instead of videos, are unable to generate novel music, and require a high level of human effort and skills. In this study, we propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion from its visual features and a deep Long Short-Term Memory Recurrent Neural Network to generate its corresponding audio signals with similar emotional inkling. The former is able to appropriately model emotions due to its fuzzy properties, and the latter is able to model data with dynamic time properties well due to the availability of the previous hidden state information. The novelty of our proposed method lies in the extraction of visual emotional features in order to transform them into audio signals with corresponding emotional aspects for users. Quantitative experiments show low mean absolute errors of 0.217 and 0.255 in the Lindsey and DEAP datasets respectively, and similar global features in the spectrograms. This indicates that our model is able to appropriately perform domain transformation between visual and audio features. Based on experimental results, our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets, and music generated by our model is also chosen more often.

DEMV-matchmaker: Emotional temporal course representation and deep similarity matching for automatic music video generation

SongDriver2: Real-time Emotion-based Music Arrangement with Soft Transition.

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Emotion-Driven Chinese Folk Music-Image Retrieval Based on De-Svm

Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural Networks

Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing

Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

An Audiovisual Correlation Matching Method Based on Fine-Grained Emotion and Feature Fusion

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Affective Visualization and Retrieval for Music Video

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

Emotional Video to Audio Transformation Using Deep Recurrent Neural Networks and a Neuro-Fuzzy System

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

A Deep Bidirectional Long Short-Term Memory Based Multi-Scale Approach for Music Dynamic Emotion Prediction

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

Music-induced emotion flow modeling by ENMI Network

Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music Generation

Emotions Time Ambient Listening History Feedback Weighted Features Extraction User-Music-Emotion Correlation Music Recommendation to the User Online Music Sampling Deep Convolutional Neural Networks Music Classification User

ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition