Abstract:Information fusion is frequently employed to integrate diverse inputs, including sensory data, features, or decisions, in order to leverage the advantageous relationships among various features and classifiers. This paper presents a novel approach for video classification using deep learning architectures, including ConvLSTM and vision transformer based fusion architectures, which incorporates the combination of spatial and temporal features, along with the utilisation of decision fusion at multiple levels. The proposed vision transformer based method uses a 3D CNN to extract spatio-temporal information and different attention mechanisms to pay attention to essential features for action recognition and thus learns spatio-temporal dependencies effectively. The effectiveness of the methods proposed in this paper is validated through empirical evaluations conducted on two well-known video classification datasets, namely UCF-101 and KTH. The experimental findings indicate that the utilisation of both spatial and temporal features is essential, with the superior performance gained by using temporal features as the primary source of features in conjunction with two types of distinct spatial features when compared to other configurations. The multi-level decision fusion approach proposed in this study produces results comparable to those of feature fusion methods while offering the advantage of reduced memory requirements and computational costs. The fusion of RGB, HOG, and optical flow representations has demonstrated the best performance compared to other fusion methods examined in this study. It has also been demonstrated that the vision transformer based approaches significantly outperformed the ConvLSTM based approaches. Furthermore, an ablation study was conducted to compare the performances of vision transformer based feature fusion approaches for enhancing the performance of video classification.

Fusing Multi-Stream Deep Networks for Video Classification

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Dense Connectivity Based Two-Stream Deep Feature Fusion Framework for Aerial Scene Classification

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification

Integration of Feature and Decision Fusion With Deep Learning Architectures for Video Classification

A Short Video Classification Framework Based on Cross-Modal Fusion

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Multimodal Deep Representation Learning for Video Classification

Evaluating Two-Stream CNN for Video Classification

Aerial Scene Classification Via Multilevel Fusion Based on Deep Convolutional Neural Networks.

Deep Feature Fusion for High-Resolution Aerial Scene Classification

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Deep Multimodal Learning: An Effective Method for Video Classification

Deep Embedded Complementary and Interactive Information for Multi-View Classification

MVF-Net: A Multi-view Fusion Network for Event-based Object Classification

A Joint Convolutional Cross ViT Network for Hyperspectral and Light Detection and Ranging Fusion Classification

Multi-focus Image Fusion Using Fully Convolutional Two-stream Network for Visual Sensors.

Two-Stream Video Classification with Cross-Modality Attention

A Late Fusion Approach for Harnessing Multi-Cnn Model High-Level Features