Abstract:Many popular video quality assessment (VQA) methods usually build models by simulating the process of human visual perception and adopt a simple regression strategy to predict video quality scores. However, these methods either hardly pay enough attention to regression processing prone to misprediction, or fail to accurately understand video content containing changes of movement or sudden movements. To remedy these, we propose a full reference (FR) video quality assessment model that integrates multi-task learning regression and analysis of spatio-temporal features to conduct video quality predictions. Firstly, the model arranges each frame of the reference and distorted videos into patches and calculates their entropy values to guide the selection of frame patches. A 2D Siamese network is then applied on the selected patches to learn spatial information. To more effectively capture temporal distortions, a multi-frame difference map is computed on each distorted video. The computed multi-frame difference maps are also partitioned into patches to select half of the ones with highest entropy values as temporal features. Additionally, we incorporate the temporal masking effect to optimize the spatial error and temporal features and adopt 3D convolutional neural network (CNN) in spatio-temporal feature fusion. Following recent evidence towards quality classification and quality regression, a constrained multi-task learning regression model is designed to aggregate the quality score, using quality classification subtask to contrain and optimize quality regression main task. Finally, the video quality score is predicted through the regression branch. We have evaluated our algorithm on five public VQA databases. The experimental results have revealed that the proposed algorithm can achieve superior performance as compared with the existing VQA methods.

Multimodal Deep Denoise Framework for Affective Video Content Analysis.

Affective Video Content Analysis Via Multimodal Deep Quality Embedding Network

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

A Multimodal Deep Regression Bayesian Network For Affective Video Content Analyses

Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition

Multimodal Local-Global Attention Network for Affective Video Content Analysis

Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural Networks

Adaptive Deep Metric Learning for Affective Image Retrieval and Classification

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

D2SP: Dynamic Dual-Stage Purification Framework for Dual Noise Mitigation in Vision-based Affective Recognition

Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning.

Deep Sentiment Features of Context and Faces for Affective Video Analysis

Temporal Enhancement for Video Affective Content Analysis

A Multi-term and Multi-task Analyzing Framework for Affective Analysis in-the-wild

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Data-Driven Affective Filtering for Images and Videos

Deep Video Quality Assessment Using Constrained Multi-Task Regression and Spatio-temporal Feature Fusion.

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

Denoising Diffusion-Augmented Hybrid Video Anomaly Detection Via Reconstructing Noised Frames