Abstract:Automatically generating natural language descriptions for in-the-wild videos is a challenging task. Most recent progress in this field has been made through the combination of Convolutional Neural Networks (CNNs) and Encoder-Decoder Recurrent Neural Networks (RNNs). However, existing Encoder-Decoder RNNs framework has difficulty in capturing a large number of long-range dependencies along with the increasing of the number of LSTM units. It brings a vast information loss and leads to poor performance for our task. To explore this problem, in this paper, we propose a novel framework, namely Cross and Conditional Long Short-Term Memory (CC-LSTM). It is composed of a novel Cross Long Short-Term Memory (Cr-LSTM) for the encoding module and Conditional Long Short-Term Memory (Co-LSTM) for the decoding module. In the encoding module, the Cr-LSTM encodes the visual input into a richly informative representation by a cross-input method. In the decoding module, the Co-LSTM feeds the visual features, which is based on generated sentence and contains the global information of the visual content, into the LSTM unit as an extra visual feature. For the work of video capturing, extensive experiments are conducted on two public datasets, i.e., MSVD and MSR-VTT. Along with visualizing the results and how our model works, these experiments quantitatively demonstrate the effectiveness of the proposed CC-LSTM on translating videos to sentences with rich semantics.

Describing Videos Using Multi-modal Fusion.

Generating Natural Video Descriptions Via Multimodal Processing

Enhanced Video Caption Generation Based on Multimodal Features.

Integrating both Visual and Audio Cues for Enhanced Video Caption

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Attention-based Visual-Audio Fusion for Video Caption Generation.

Attention-Based Multimodal Fusion for Video Description

Bidirectional Long-Short Term Memory for Video Description

Multimodal feature fusion based on object relation for video captioning

Multi-Modal interpretable automatic video captioning

Describing Video with Attention-Based Bidirectional LSTM

Dual-Stream Recurrent Neural Network for Video Captioning

Video Captioning with Transferred Semantic Attributes.

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Multi-View Feature Fusion and Visual Prompt for Remote Sensing Image Captioning

Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning

Multimodal Memory Modelling for Video Captioning

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Everything is a Video: Unifying Modalities through Next-Frame Prediction

A Dataset with Multi-Modal Information and Multi-Granularity Descriptions for Video Captioning

Learning Multimodal Attention LSTM Networks for Video Captioning.