Abstract:Video captioning can understand videos, provide decision-makers with user-friendly natural language narration, alleviate the gap between man and machine, and promote human-machine interaction. Therefore, it has good application prospects in emergency response scenarios, such as aerial refueling and assisted driving. However, there are two problems with the current video captioning methods: 1) they are mainly oriented to general domains, and there are few studies on industrial applications; 2) the methods only interact with the semantics of video and text from a single view (tokens or sentences). For the above problems, this paper proposes a multi-view end-to-end video caption (MVVC) method for human-machine fusion. Compared with the previous video captioning methods, 1) the MVVC model is an end-to-end model which directly takes video frames as input without object detection for each frame; 2) we perform cross-modal interaction of video and text from both local and global views. So the model can simultaneously understand video content and generate text at two granularities(tokens to sentences). In order to verify the performance of the new model, we conducted a series of comparative and ablation experiments on MVVC on the two data sets of aerial refueling and automatic driving. The experiments show that our new method has a stronger video understanding ability and can generate more accurate video descriptions. At the same time, it also verified that the video captioning task could promote human-machine fusion and assist decision-making in emergency scenarios. Note to Practitioners —The motivation of this paper is to convert the video into natural language so that the autonomous system can automatically understand the observed scene, describe it to relevant stakeholders, and promote human-machine fusion. However, the traditional method needs to process the video offline and has an insufficient understanding of the video content. Therefore, this paper proposes an end-to-end video capture method based on multi-view semantic alignment, which can understand the video content directly from the original video pixels in real time and improve captioning accuracy. It can meet the application requirements of the industrial field and has practical application value.

Multirate Multimodal Video Captioning.

Multi-Task Video Captioning with a Stepwise Multimodal Encoder

Describing Videos Using Multi-modal Fusion.

Generating Natural Video Descriptions Via Multimodal Processing

Enhanced Video Caption Generation Based on Multimodal Features.

Multimodal Memory Modelling for Video Captioning

Multi-Modal interpretable automatic video captioning

Bidirectional Long-Short Term Memory for Video Description

Video Captioning with Guidance of Multimodal Latent Topics

From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning

Dual-Stream Recurrent Neural Network for Video Captioning

Multimodal feature fusion based on object relation for video captioning

Joint Multi-Scale Information and Long-Range Dependence for Video Captioning

Learning Multimodal Attention LSTM Networks for Video Captioning.

Richer Semantic Visual and Language Representation for Video Captioning

MSR Video to Language Challenge

Describing Video with Attention-Based Bidirectional LSTM

End-to-End Video Captioning Based on Multiview Semantic Alignment for Human–Machine Fusion

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Multimodality-guided Visual-Caption Semantic Enhancement

Multimodal-enhanced hierarchical attention network for video captioning