Abstract:Video captioning can understand videos, provide decision-makers with user-friendly natural language narration, alleviate the gap between man and machine, and promote human-machine interaction. Therefore, it has good application prospects in emergency response scenarios, such as aerial refueling and assisted driving. However, there are two problems with the current video captioning methods: 1) they are mainly oriented to general domains, and there are few studies on industrial applications; 2) the methods only interact with the semantics of video and text from a single view (tokens or sentences). For the above problems, this paper proposes a multi-view end-to-end video caption (MVVC) method for human-machine fusion. Compared with the previous video captioning methods, 1) the MVVC model is an end-to-end model which directly takes video frames as input without object detection for each frame; 2) we perform cross-modal interaction of video and text from both local and global views. So the model can simultaneously understand video content and generate text at two granularities(tokens to sentences). In order to verify the performance of the new model, we conducted a series of comparative and ablation experiments on MVVC on the two data sets of aerial refueling and automatic driving. The experiments show that our new method has a stronger video understanding ability and can generate more accurate video descriptions. At the same time, it also verified that the video captioning task could promote human-machine fusion and assist decision-making in emergency scenarios. Note to Practitioners —The motivation of this paper is to convert the video into natural language so that the autonomous system can automatically understand the observed scene, describe it to relevant stakeholders, and promote human-machine fusion. However, the traditional method needs to process the video offline and has an insufficient understanding of the video content. Therefore, this paper proposes an end-to-end video capture method based on multi-view semantic alignment, which can understand the video content directly from the original video pixels in real time and improve captioning accuracy. It can meet the application requirements of the industrial field and has practical application value.

Multi-level video captioning method based on semantic space

VMSG: a video caption network based on multimodal semantic grouping and semantic attention

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Semantic-Driven Saliency-Context Separation for Video Captioning

A Video Captioning Method by Semantic Topic-Guided Generation

Video Captioning with Transferred Semantic Attributes.

Discriminative Latent Semantic Graph for Video Captioning

Multi-Modal interpretable automatic video captioning

Multimodal Semantic Attention Network for Video Captioning

Center-enhanced video captioning model with multimodal semantic alignment

End-to-End Video Captioning Based on Multiview Semantic Alignment for Human–Machine Fusion

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Video Captioning With Attention-Based LSTM and Semantic Consistency

Multimodality-guided Visual-Caption Semantic Enhancement

Bidirectional Long-Short Term Memory for Video Description

Video Captioning with Guidance of Multimodal Latent Topics

Attentive Semantic Video Generation Using Captions

Multi-scale features with temporal information guidance for video captioning

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Structured Encoding Based on Semantic Disambiguation for Video Captioning

Concept Parser with Multimodal Graph Learning for Video Captioning