Abstract:Video captioning can understand videos, provide decision-makers with user-friendly natural language narration, alleviate the gap between man and machine, and promote human-machine interaction. Therefore, it has good application prospects in emergency response scenarios, such as aerial refueling and assisted driving. However, there are two problems with the current video captioning methods: 1) they are mainly oriented to general domains, and there are few studies on industrial applications; 2) the methods only interact with the semantics of video and text from a single view (tokens or sentences). For the above problems, this paper proposes a multi-view end-to-end video caption (MVVC) method for human-machine fusion. Compared with the previous video captioning methods, 1) the MVVC model is an end-to-end model which directly takes video frames as input without object detection for each frame; 2) we perform cross-modal interaction of video and text from both local and global views. So the model can simultaneously understand video content and generate text at two granularities(tokens to sentences). In order to verify the performance of the new model, we conducted a series of comparative and ablation experiments on MVVC on the two data sets of aerial refueling and automatic driving. The experiments show that our new method has a stronger video understanding ability and can generate more accurate video descriptions. At the same time, it also verified that the video captioning task could promote human-machine fusion and assist decision-making in emergency scenarios. Note to Practitioners —The motivation of this paper is to convert the video into natural language so that the autonomous system can automatically understand the observed scene, describe it to relevant stakeholders, and promote human-machine fusion. However, the traditional method needs to process the video offline and has an insufficient understanding of the video content. Therefore, this paper proposes an end-to-end video capture method based on multi-view semantic alignment, which can understand the video content directly from the original video pixels in real time and improve captioning accuracy. It can meet the application requirements of the industrial field and has practical application value.

Towards Bridging Video and Language by Caption Generation and Sentence Localization.

Video Captioning Using Global-Local Representation

An Attempt towards Interpretable Audio-Visual Video Captioning

Vision and language: from visual perception to content creation

Discriminative Latent Semantic Graph for Video Captioning

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning

Jointly Modeling Embedding and Translation to Bridge Video and Language

Exploring the Role of Audio in Video Captioning

Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning

Learning Video-Text Aligned Representations for Video Captioning

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Deep Learning for Video Captioning: A Review

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

End-to-End Video Captioning Based on Multiview Semantic Alignment for Human–Machine Fusion

Integrating both Visual and Audio Cues for Enhanced Video Caption

Motion Guided Region Message Passing for Video Captioning

Video Captioning Via Relation-Aware Graph Learning

GL-RG: Global-Local Representation Granularity for Video Captioning

Seeing and Hearing Too: Audio Representation for Video Captioning.

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst