Anticipation Video Captioning of Aerial Refueling Based on Combined Attention Masking Mechanism

Shuai Wu,Wei Tong,Ya Duan,Weidong Yang,Guangyu Zhu,Edmond Q. Wu
DOI: https://doi.org/10.1109/tiv.2024.3353172
IF: 8.2
2024-01-01
IEEE Transactions on Intelligent Vehicles
Abstract:Incorporating visual cues to anticipate and describe future events holds significant promise for enhancing user-friendly early warning systems in emergency response scenarios. However, a notable limitation in existing video captioning techniques is their predominant focus on describing ongoing events within observed videos. The challenging task of predicting captions for unobserved videos based on observed visual cues remains largely unaddressed. In response to this gap, we introduce a novel neural network architecture termed the Anticipation Video Captioning Transformer, which is built upon the transformer architecture and comprises three essential modules. The first module serves as a video feature extractor, harnessing the capabilities of a video transformer to extract spatiotemporal features from the observed video data. The second module is a multimodal mask language model to learn the intricate correlations between video content and corresponding captions. The third module is a decoder, generating observed and anticipation video captions. In assessing the efficacy of our proposed method and its potential applicability in emergency scenarios, we have developed a specialized dataset dedicated to aerial refueling anticipation video captioning. Our experimental evaluations encompass a diverse range of qualitative and quantitative analyses, all of which consistently demonstrate the effectiveness of our approach in furnishing user-friendly anticipation captions. Overall, our work represents a significant step forward in video captioning, extending its capabilities beyond merely describing the present state of affairs to encompass the anticipation of future events. This innovation can potentially enhance early warning systems and improve emergency response procedures.
What problem does this paper attempt to address?