Abstract:Videos are inherently multimodal. This paper studies the problem of how to fully exploit the abundant multimodal clues for improved video categorization. We introduce a hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two Long Short Term Memory networks with extracted appearance and motion features as inputs. Finally, we also propose to refine the prediction scores by leveraging contextual relationships among video semantics. The hybrid deep learning framework is able to exploit a comprehensive set of multimodal features for video classification. Through an extensive set of experiments, we demonstrate that (1) LSTM networks which model sequences in an explicitly recurrent manner are highly complementary with CNN models; (2) the feature fusion network which produces a fused representation through modeling feature relationships outperforms alternative fusion strategies; (3) the semantic context of video classes can help further refine the predictions for improved performance. Experimental results on two challenging benchmarks, the UCF-101 and the Columbia Consumer Videos (CCV), provide strong quantitative evidence that our framework achieves promising results: $93.1\%$ on the UCF-101 and $84.5\%$ on the CCV, outperforming competing methods with clear margins.

Multimodal Salient Objects: General Building Blocks Of Semantic Video Concepts

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Non-rigid Video Object Segmentation Based on Semantic Multi-level Framework.

Video Semantic Concept Detection Using Multi-Modality Subspace Correlation Propagation

A Statistics-Based Method For Video Semantic Analysis

Salient Objects: Semantic Building Blocks For Image Concept Interpretation

An HMM-based framework for video semantic analysis

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Modality Mixture Projections for Semantic Video Event Detection

Robust Semantic Concept Detection in Large Video Collections

Semantic Multimodal Violence Detection Based on Local-to-global Embedding

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

News Video Retrieval By Learning Multimodal Semantic Information

Automatic image annotation by using concept-sensitive salient objects for image content representation.

Statistical Modeling and Conceptualization of Natural Images

Automatic Moving Object Extraction toward Content-Based Video Representation and Indexing

Semantic Video Classification And Feature Subset Selection Under Context And Concept Uncertainty

Active post-refined multimodality video semantic concept detection with tensor representation.

SAM: Modeling Scene, Object and Action with Semantics Attention Modules for Video Recognition

Submodular video object proposal selection for semantic object segmentation

Video retrieval with multi-modal features.