VEMOCLAP: A video emotion classification web application

Serkan Sulun,Paula Viana,Matthew E. P. Davies
2024-10-22
Abstract:We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at <a class="link-external link-http" href="http://serkansulun.com/app" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is video emotion classification (Video Emotion Classification). Specifically, the authors have developed a web application named VEMOCLAP, aiming to analyze the emotional content in any video provided by users. The following are the main problems and goals of this research: 1. **Improve the accuracy of video emotion classification**: - The authors have improved previous work by using open - source pre - trained models to extract video frame and audio features and using the multi - head cross - attention mechanism to efficiently fuse these features. - On the Ekman - 6 video emotion dataset, their method has increased the state - of - the - art classification accuracy by 4.3%. 2. **Provide an easy - to - use online application**: - VEMOCLAP is an open - source and easy - to - use web application. Users can run the model by uploading their own videos or providing YouTube links to analyze the emotions in the videos. - The application not only outputs the predicted emotion labels but also provides additional analysis functions, such as automatic speech recognition (ASR), optical character recognition (OCR), face detection and expression classification, audio classification, and image caption generation. 3. **Dataset cleaning**: - The authors have examined and cleaned the Ekman - 6 dataset, removing problematic samples to improve the performance of the training model. - The cleaned dataset has increased the classification accuracy by 2.6%, but for fair comparison, they excluded this result in the report. 4. **Multi - modal feature fusion**: - The research has solved the problem of differences in dimension and time length of features extracted from different pre - trained models. By normalizing and projecting to a common dimension, and then using the multi - head attention module for feature fusion. ### Main contributions - **Improve classification accuracy**: On the Ekman - 6 dataset, the classification accuracy has been increased by 4.3%. - **Dataset cleaning**: A list of problematic samples in the Ekman - 6 dataset has been provided to help other researchers improve their models. - **Open - source web application**: An open - source web application has been launched, enabling users to easily analyze and classify emotions in any video. Through these efforts, the authors have not only improved the technical level of video emotion classification but also provided a practical tool for researchers and ordinary users.