Abstract:Video captioning can understand videos, provide decision-makers with user-friendly natural language narration, alleviate the gap between man and machine, and promote human-machine interaction. Therefore, it has good application prospects in emergency response scenarios, such as aerial refueling and assisted driving. However, there are two problems with the current video captioning methods: 1) they are mainly oriented to general domains, and there are few studies on industrial applications; 2) the methods only interact with the semantics of video and text from a single view (tokens or sentences). For the above problems, this paper proposes a multi-view end-to-end video caption (MVVC) method for human-machine fusion. Compared with the previous video captioning methods, 1) the MVVC model is an end-to-end model which directly takes video frames as input without object detection for each frame; 2) we perform cross-modal interaction of video and text from both local and global views. So the model can simultaneously understand video content and generate text at two granularities(tokens to sentences). In order to verify the performance of the new model, we conducted a series of comparative and ablation experiments on MVVC on the two data sets of aerial refueling and automatic driving. The experiments show that our new method has a stronger video understanding ability and can generate more accurate video descriptions. At the same time, it also verified that the video captioning task could promote human-machine fusion and assist decision-making in emergency scenarios. Note to Practitioners —The motivation of this paper is to convert the video into natural language so that the autonomous system can automatically understand the observed scene, describe it to relevant stakeholders, and promote human-machine fusion. However, the traditional method needs to process the video offline and has an insufficient understanding of the video content. Therefore, this paper proposes an end-to-end video capture method based on multi-view semantic alignment, which can understand the video content directly from the original video pixels in real time and improve captioning accuracy. It can meet the application requirements of the industrial field and has practical application value.

End-to-End Video Captioning Based on Multiview Semantic Alignment for Human–Machine Fusion

Measuring apoptosis in neural stem cells.

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Multi-level video captioning method based on semantic space

A Video Captioning Method by Semantic Topic-Guided Generation

Video Captioning with Transferred Semantic Attributes.

Anticipation Video Captioning of Aerial Refueling Based on Combined Attention Masking Mechanism

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Integrating both Visual and Audio Cues for Enhanced Video Caption

Learning Video-Text Aligned Representations for Video Captioning

Video Captioning With Attention-Based LSTM and Semantic Consistency

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Video Interactive Captioning with Human Prompts.

Edit As You Wish: Video Caption Editing with Multi-grained User Control

Semantic-Driven Saliency-Context Separation for Video Captioning

Non-Autoregressive Coarse-to-Fine Video Captioning

Subject-Oriented Video Captioning

Attention-based Visual-Audio Fusion for Video Caption Generation.

The nature of respiratory changes associated with sleep onset.

Multimodality-guided Visual-Caption Semantic Enhancement