Abstract:In this paper, we resolve the problem of multi-modality video representation and semantic concept detection. Interaction and integration of multi-modality media types such as visual, audio and textual data in video are essential to video semantic analysis. Traditionally, videos are represented as vectors in the Euclidean space. Many learning algorithms are then taken to these vectors in a high dimensional space for dimension reduction, classification, clustering and so on. However, the multiple modalities in video not only have their own properties, but also have correlations among them; whereas the simple vector representation weakens the power of these relatively independent modalities and even ignores their relations to some extent. In this paper, we introduce a higher-order tensor framework for video analysis, in which we represent image, video and text three modalities in video shots as data points by the 3rd-order tensor called tensorshots. We propose a novel dimension reduction method that explicitly considers the manifold structure of the tensor space from multimodal media data which is temporal associated co-occurrence and then detect video semantic concepts through powerful classifiers which take tensor as input. Our algorithm preserves the intrinsic structure of the submanifold where tensorshots are sampled, and is also able to map out-of-sample data points directly. Moreover we apply an active learning based contextual and temporal post-refining strategy to enhance detection accuracy. Experiment results show that our method improves the performance of video semantic concept detection.

Multi-Layer Multi-Instance Learning for Video Concept Detection

MILC<Superscript>2</Superscript>: A Multi-Layer Multi-Instance Learning Approach to Video Concept Detection

Marginalized multi-layer multi-instance kernel for video concept detection

Multi-instance Kernel Learning with Concept Weights of Instance Space

Exploiting Generalized Discriminative Multiple Instance Learning for Multimedia Semantic Concept Detection

Video Concept Detection Based on Multiple Features and Classifiers Fusion

Semi-supervised multi-instance multi-label learning for video annotation task.

Research on Multi-concept Learning Based on Inter-concept Relation

Ensemble Multi-Instance Multi-Label Learning Approach for Video Annotation Task

Multi-Concept Multi-Modality Active Learning for Interactive Video Annotation

Tensor-based transductive learning for multimodality video semantic concept detection

Multi-level Feature Representations for Video Semantic Concept Detection

Multiple Hypergraph Ranking for Video Concept Detection

Interactive Video Annotation By Multi-Concept Multi-Modality Active Learning

Robust Semantic Concept Detection in Large Video Collections

Deep Multi-View Concept Learning.

Video Caption Detection Algorithm Based on Multiple Instance Learning

Video Semantic Concept Detection Based on Multi-modality Fusion

Active post-refined multimodality video semantic concept detection with tensor representation.

TRECVid 2013 Semantic Video Concept Detection by NTT-MD-DUT.

High-Level Video Semantic Concept Detection Based on Multi-level Feature Representations.