A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)

Lam Pham,Phat Lam,Tin Nguyen,Hieu Tang,Alexander Schindler
2024-05-02
Abstract:In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \& visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Development of a Comprehensive Audio-Visual Analysis Toolchain**: The paper proposes a comprehensive audio-visual analysis toolchain utilizing deep learning multimodal methods. This toolchain integrates various tasks, including Speech-to-Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC), enabling comprehensive analysis of audio and visual data in input videos. 2. **Audio-Visual Clustering and Summarization Applications**: Using this toolchain, researchers have developed two general applications—audio-visual clustering and comprehensive audio-visual summarization. These applications can classify a large number of videos based on content and generate detailed textual descriptions. 3. **Riot or Violence Situation Detection**: Based on the aforementioned general applications, the paper further proposes a specific application—riot or violence situation detection. This application defines violence-related keywords and combines audio events, scenes, and visual information to determine whether a riot or violence situation occurs in the video and assess its severity. In summary, the core objective of this paper is to develop a flexible and scalable audio-visual analysis toolchain to support various application scenarios, particularly in the detection of riot or violence situations in the public safety domain.