CFMMC-Align: Coarse-Fine Multi-Modal Contrastive Alignment Network for Traffic Event Video Question Answering

Kan Guo,Daxin Tian,Yongli Hu,Chunmian Lin,Yanfeng Sun,Jianshan Zhou,Xuting Duan,Junbin Gao,Baocai Yin
DOI: https://doi.org/10.1109/tcsvt.2024.3409453
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Traffic video question answering (TrafficVQA) constitutes a specialized VideoQA task designed to enhance the basic comprehension and intricate reasoning capacities of videos, specifically focusing on traffic events. Recent VideoQA models employ pretrained visual and textual encoder models to bridge the feature space gap between visual and textual data. However, in addressing the unique challenges inherent to the TrafficVQA task, three pivotal issues must be addressed: (i) Dimension Gap: Between the pretrained image (appearance feature) and video (motion feature) models, there exists a conspicuous dimension difference in static and dynamic visual data; (ii) Scene Gap: The common real-world datasets and the traffic event datasets differ in visual scene content; (iii) Modality Gap: A pronounced feature distribution discrepancy emerges between traffic video and text data. To alleviate these challenges, we introduce the coarse-fine multimodal contrastive alignment network (CFMMC-Align). This model leverages sequence-level and token-level multimodal features, grounded in an unsupervised visual multimodal contrastive loss to mitigate dimension and scene gaps and a supervised visual-textual contrastive loss to alleviate modality discrepancies. Finally, the model is validated on the challenging public TrafficVQA dataset SUTD-TrafficQA and outperforms the state-of-the-art method by a substantial margin ( 50.2% compared to 46.0% ). The code is available at https://github.com/guokan987/CFMMC-Align.
What problem does this paper attempt to address?