Abstract:Video captioning can understand videos, provide decision-makers with user-friendly natural language narration, alleviate the gap between man and machine, and promote human-machine interaction. Therefore, it has good application prospects in emergency response scenarios, such as aerial refueling and assisted driving. However, there are two problems with the current video captioning methods: 1) they are mainly oriented to general domains, and there are few studies on industrial applications; 2) the methods only interact with the semantics of video and text from a single view (tokens or sentences). For the above problems, this paper proposes a multi-view end-to-end video caption (MVVC) method for human-machine fusion. Compared with the previous video captioning methods, 1) the MVVC model is an end-to-end model which directly takes video frames as input without object detection for each frame; 2) we perform cross-modal interaction of video and text from both local and global views. So the model can simultaneously understand video content and generate text at two granularities(tokens to sentences). In order to verify the performance of the new model, we conducted a series of comparative and ablation experiments on MVVC on the two data sets of aerial refueling and automatic driving. The experiments show that our new method has a stronger video understanding ability and can generate more accurate video descriptions. At the same time, it also verified that the video captioning task could promote human-machine fusion and assist decision-making in emergency scenarios. Note to Practitioners —The motivation of this paper is to convert the video into natural language so that the autonomous system can automatically understand the observed scene, describe it to relevant stakeholders, and promote human-machine fusion. However, the traditional method needs to process the video offline and has an insufficient understanding of the video content. Therefore, this paper proposes an end-to-end video capture method based on multi-view semantic alignment, which can understand the video content directly from the original video pixels in real time and improve captioning accuracy. It can meet the application requirements of the industrial field and has practical application value.

Research on Video Captioning Based on Multifeature Fusion.

Integrating both Visual and Audio Cues for Enhanced Video Caption

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning

Measuring apoptosis in neural stem cells.

Multi-scale features with temporal information guidance for video captioning

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Attention-based Visual-Audio Fusion for Video Caption Generation.

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Multimodal feature fusion based on object relation for video captioning

Multimodality-guided Visual-Caption Semantic Enhancement

Event-centric multi-modal fusion method for dense video captioning

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Chinese image captioning with fusion encoder and visual keyword search

CapsFusion: Rethinking Image-Text Data at Scale

Learning Video-Text Aligned Representations for Video Captioning

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning

End-to-End Video Captioning Based on Multiview Semantic Alignment for Human–Machine Fusion

Exploring the Role of Audio in Video Captioning

Augmented Partial Mutual Learning with Frame Masking for Video Captioning