Abstract:We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at <a class="link-external link-http" href="http://serkansulun.com/app" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is video emotion classification (Video Emotion Classification). Specifically, the authors have developed a web application named VEMOCLAP, aiming to analyze the emotional content in any video provided by users. The following are the main problems and goals of this research: 1. **Improve the accuracy of video emotion classification**: - The authors have improved previous work by using open - source pre - trained models to extract video frame and audio features and using the multi - head cross - attention mechanism to efficiently fuse these features. - On the Ekman - 6 video emotion dataset, their method has increased the state - of - the - art classification accuracy by 4.3%. 2. **Provide an easy - to - use online application**: - VEMOCLAP is an open - source and easy - to - use web application. Users can run the model by uploading their own videos or providing YouTube links to analyze the emotions in the videos. - The application not only outputs the predicted emotion labels but also provides additional analysis functions, such as automatic speech recognition (ASR), optical character recognition (OCR), face detection and expression classification, audio classification, and image caption generation. 3. **Dataset cleaning**: - The authors have examined and cleaned the Ekman - 6 dataset, removing problematic samples to improve the performance of the training model. - The cleaned dataset has increased the classification accuracy by 2.6%, but for fair comparison, they excluded this result in the report. 4. **Multi - modal feature fusion**: - The research has solved the problem of differences in dimension and time length of features extracted from different pre - trained models. By normalizing and projecting to a common dimension, and then using the multi - head attention module for feature fusion. ### Main contributions - **Improve classification accuracy**: On the Ekman - 6 dataset, the classification accuracy has been increased by 4.3%. - **Dataset cleaning**: A list of problematic samples in the Ekman - 6 dataset has been provided to help other researchers improve their models. - **Open - source web application**: An open - source web application has been launched, enabling users to easily analyze and classify emotions in any video. Through these efforts, the authors have not only improved the technical level of video emotion classification but also provided a practical tool for researchers and ordinary users.

VEMOCLAP: A video emotion classification web application

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Learning Emotion Representations from Verbal and Nonverbal Communication

Emolysis: A Multimodal Open-Source Group Emotion Analysis and Visualization Toolkit

Video emotion analysis enhanced by recognizing emotion in video comments

Performance Analysis and Evaluation of Cloud Vision Emotion APIs

Deep Sentiment Features of Context and Faces for Affective Video Analysis

Classifying Emotions and Engagement in Online Learning Based on a Single Facial Expression Recognition Neural Network

Emotion Classification Based on Pulsatile Images Extracted from Short Facial Videos via Deep Learning

Decoding viewer emotions in video ads

MSEVA : A System for Multimodal Short Videos Emotion Visual Analysis

Real Time Emotion Analysis Using Deep Learning for Education, Entertainment, and Beyond

Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

Online Learning for Wearable EEG-Based Emotion Classification

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

EMOCA: Emotion Driven Monocular Face Capture and Animation

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition