Abstract:Convolutional neural networks (CNNs) and Transformer architectures have traditionally been recognized as the preferred models for addressing computer vision tasks. However, there has been a recent surge in the popularity of networks based on multi-layer perceptron (MLP) structures that do not rely on convolution or attention mechanisms. These MLP architectures have demonstrated exceptional performance in image classification tasks, exhibiting lower time complexity while maintaining high accuracy. In contrast, video classification tasks involve larger amounts of data and necessitate more intricate feature extraction, resulting in increased time and resource consumption. To enhance computational efficiency and minimize resource utilization, we propose a convolution-free and Transformer-free architecture for video classification called Video-MLP for video classification. Video-MLP utilizes a simple MLP structure to learn video features. Specifically, it comprises two types of layers: Spatial-Mixer and Temporal-Mixer, which respectively capture spatial and temporal information. The Spatial-Mixer extracts spatial information from each frame along the height and width dimensions, while the Temporal-Mixer models temporal information for the same spatial positions across frames. To improve the efficiency of spatial-temporal modeling in our network, we used a spatial-temporal information fusion approach to integrate information at different scales. Additionally, we grouped the input data along the time dimension and designed three different grouping schemes when extracting temporal information. The experimental results indicate that Video-MLP achieved accuracy rates of 87.2% on the Kinetics-400 dataset and 75.3% on the SomethingV2 dataset, outperforming models with equivalent computational complexity. Notably, Video-MLP achieved these results without using convolution and attention mechanisms, and without pre-training on large-scale image and video datasets.

Multi-Layer Transformer for Video Classification.

Multi-semantic Representation with Transformer Network for Video Classification.

TransVOS: Video Object Segmentation with Transformers

A Multi-scale Multi-modal Multi-dimension Joint Transformer for Two-Stream Action Classification.

Transformer Video Classification algorithm based on video token-to-token.

Video-Mlp: Convolution-Free, Attention-Free Architecture for Video Classification

Token Shift Transformer for Video Classification

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

Multi-Modal Fusion Transformer for Multivariate Time Series Classification

Fusing Multi-Stream Deep Networks for Video Classification

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

Space or time for video classification transformers

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

MultiScale spectral–spatial convolutional transformer for hyperspectral image classification

A Multi-Modal Transformer Approach for Football Event Classification

MultiScale Spectral-Spatial Convolutional Transformer for Hyperspectral Image Classification

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Convolutional transformer network for fine-grained action recognition

Multi-entity Video Transformers for Fine-Grained Video Representation Learning

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification