Abstract:Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g., semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video-captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Second, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusionmodule uses a novel relation-aware attentionmechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-endmanner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.

A Novel Framework for Semantic-Based Video Retrieval

Video diver: generic video indexing with diverse features.

Semantic-based surveillance video retrieval

Visual and textual fusion for semantically supervised region-based retrieval

A Novel Video Searching Model Based on Ontology Inference and Multimodal Information Fusion.

A novel multi-feature fusion and sparse coding-based framework for image retrieval

A Feature Selection Framework for Video Semantic Recognition Via Integrated Cross-Media Analysis and Embedded Learning.

Video retrieval with multi-modal features.

An Efficient Approach Based on Image Pixel and Semantic Features Towards Video Retrieval

Design and Implementation of Semantic Concept Based Video Retrieval System

Multi Semantic Feature Fusion Framework for Video Segmentation and Description

A Novel Video Content Understanding Scheme Based on Feature Combination Strategy.

Research On The Video Semantic Analysis Framework Based On Multiple Feature Fusion And Deep Learning Structure

Semantic Enhanced Video Captioning with Multi-feature Fusion

An Improved System for Concept-Based Video Retrieval

Exploiting Visual Semantic Reasoning for Video-Text Retrieval

Using high-level semantic features in video retrieval

Semantics-Biased Rapid Retrieval for Video Databases

Video Concept Detection Based on Multiple Features and Classifiers Fusion

Applying Semantic Association To Support Content-Based Video Retrieval

A novel method of image retrieval based on combination of semantic and visual features