Abstract:Automatically describing video content with natural language is a fundamental challenge of computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true.This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

Generating video description with Long-Short Term Memory

Generating image descriptions with multidirectional 2D long short-term memory

Video Description with Integrated Visual and Textual Information

Bidirectional Long-Short Term Memory for Video Description

Jointly Modeling Embedding and Translation to Bridge Video and Language

Deep Hierarchical Attention Network for Video Description

Video Description Using Learning Multiple Features

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

First-Feed LSTM Model for Video Description

The Long-Short Story of Movie Description

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Describing Video with Multiple Descriptions

Video Captioning with Transferred Semantic Attributes.

Richer Semantic Visual and Language Representation for Video Captioning

LSTM-in-LSTM for Generating Long Descriptions of Images.

Describing Video with Attention-Based Bidirectional LSTM

Generating Natural Video Descriptions Via Multimodal Processing

Video Description Generation Using Audio and Visual Cues

Video Captioning with Semantic Information from the Knowledge Base

A Video Description Model with Improved Attention Mechanism

Attention-Based Convolutional LSTM for Describing Video