Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Yaqun Fang,Beibei Zhang,Gangshan Wu,Tongwei Ren
DOI: https://doi.org/10.1145/3503161.3551600
2022-10-10
Abstract:The Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning framework to simultaneously predict multiple tasks with visual, text, audio and pose features. In addition, to answer the queries of DVUC, we design multiple answering strategies and use video language transformer which learns cross-modal information for matching videos with text choices. The final DVUC result shows that our method ranks first for group one of movie-level queries, and ranks third for both of group one and group two of scene-level queries.
Computer Science
What problem does this paper attempt to address?