Bi-attention Modal Separation Network for Multimodal Video Fusion

Pengfei Du,Yali Gao,Xiaoyong Li
DOI: https://doi.org/10.1007/978-3-030-98358-1_46
2022-01-01
Abstract:With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal video understanding has received increasing attention from the scientific community. Video is usually composed of multimodal signals, such as video, text, image and audio, etc. The main method addressing this task is to develop powerful multimodal fusion techniques. Multimodal data fusion is to transform data from multiple single-mode representations to a compact multimodal representation. Effective multimodal fusion method should contain two key characteristics: the consistency and the difference. Previous studies mainly focused on applying different interaction methods to different modal fusion such as late fusion, early fusion, attention fusion, etc., but ignored the study of modal independence in the fusion process. In this paper, we introduce a fusion approach called bi-attention modal separation fusion network(BAMS) which can extract and integrate key information from various modalities and performs fusion and separation on modality representations. We conduct thorough ablation studies, and our experiments on datasets MOSI and MOSEI demonstrate significant gains over state-of-the-art models.
What problem does this paper attempt to address?