Abstract:Digital video is widely used to record people's daily lives and share people's moods, but few researchers have conducted research on the consistency of emotional expression between short videos and music. In order to be able to match the appropriate background music to the short video image autonomously and efficiently, the paper analyzed the emotional connection between the two from the audio-visual synesthesia. First, emotional semantics was used as a bridge to connect video data and music data, and a video-music synesthesia data set based on semantic words was constructed. Then, an attention mechanism was incorporated to better extract key features in video images. In the extraction of music features, an improved lenet5 network was used, and the optimal network parameters were determined through experiments. Finally, the two types of features were fused and the mutual retrieval between video and music was performed. In order to compare the performance of different models, different CNN models were calculated in the processing of video images, including VGG16, VGG19, AlexNet and GoogleNet, and the attention mechanism was added to each network for calculation to compare its retrieval accuracy. In the processing of music data, different CNN algorithms were also used for comparative experiments, and networks with different layers were used to determine the optimal results. The experimental results show that the audiovisual synesthesia retrieval model based on emotion can effectively measure the emotional similarity between video images and music, and the method of the paper can produce a good match between them. The research method of the paper is the exploration of computer synesthetic intelligence, which can stimulate the creative inspiration of image and music creative designers. While enhancing the emotional experience of digital products, it also improves the efficiency and quality of development.

WeaveNet: End-to-End Audiovisual Sentiment Analysis

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Learning Robust Heterogeneous Signal Features from Parallel Neural Network for Audio Sentiment Analysis

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural Networks

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Research on Emotional Semantic Retrieval of Attention Mechanism Oriented to Audio-visual Synesthesia

SKEAFN: Sentiment Knowledge Enhanced Attention Fusion Network for multimodal sentiment analysis

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

Emotional Video Captioning With Vision-Based Emotion Interpretation Network