Abstract:Videos are inherently multimodal. This paper studies the problem of how to fully exploit the abundant multimodal clues for improved video categorization. We introduce a hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two Long Short Term Memory networks with extracted appearance and motion features as inputs. Finally, we also propose to refine the prediction scores by leveraging contextual relationships among video semantics. The hybrid deep learning framework is able to exploit a comprehensive set of multimodal features for video classification. Through an extensive set of experiments, we demonstrate that (1) LSTM networks which model sequences in an explicitly recurrent manner are highly complementary with CNN models; (2) the feature fusion network which produces a fused representation through modeling feature relationships outperforms alternative fusion strategies; (3) the semantic context of video classes can help further refine the predictions for improved performance. Experimental results on two challenging benchmarks, the UCF-101 and the Columbia Consumer Videos (CCV), provide strong quantitative evidence that our framework achieves promising results: $93.1\%$ on the UCF-101 and $84.5\%$ on the CCV, outperforming competing methods with clear margins.

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Deep Relationship Analysis in Video with Multimodal Feature Fusion

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Deep Video Understanding with Video-Language Model

Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Towards Long Video Understanding via Fine-detailed Video Story Generation

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Multimodal Deep Representation Learning for Video Classification

Video Content Categorization Using the Double Decomposition

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing