Abstract:This study introduces an optimal topology of vision transformers for real-time video action recognition in a cloud-based solution. Although model performance is a key criterion for real-time video analysis use cases, inference latency plays a more crucial role in adopting such technology in real-world scenarios. Our objective is to reduce the inference latency of the solution while admissibly maintaining the vision transformer’s performance. Thus, we employed the optimal cloud components as the foundation of our machine learning pipeline and optimized the topology of vision transformers. We utilized UCF101, including more than one million action recognition video clips. The modeling pipeline consists of a preprocessing module to extract frames from video clips, training two-dimensional (2D) vision transformer models, and deep learning baselines. The pipeline also includes a postprocessing step to aggregate the frame-level predictions to generate the video-level predictions at inference. The results demonstrate that our optimal vision transformer model with an input dimension of 56 × 56 × 3 with eight attention heads produces an F1 score of 91.497% for the testing set. The optimized vision transformer reduces the inference latency by 40.70%, measured through a batch-processing approach, with a 55.63% faster training time than the baseline. Lastly, we developed an enhanced skip-frame approach to improve the inference latency by finding an optimal ratio of frames for prediction at inference, where we could further reduce the inference latency by 57.15%. This study reveals that the vision transformer model is highly optimizable for inference latency while maintaining the model performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in real - time video action recognition, how to optimize the topology of the Vision Transformer to achieve low inference latency in the cloud environment while maintaining model performance. Specifically, the paper focuses on the following aspects: 1. **Inference Latency in Real - Time Applications**: - In real - time video analysis, although model performance is a key criterion, inference latency (i.e., the time from input to output prediction) is more important for practical deployment. Therefore, the goal of the research is to reduce inference latency within an acceptable performance range. 2. **Optimization of Vision Transformer**: - The author reduces inference latency by optimizing the topology of the Vision Transformer. This includes selecting the optimal cloud computing components and adjusting the parameters of the Vision Transformer, such as the input frame size and the number of multi - head self - attention modules. 3. **Enhanced Frame - Skipping Mechanism**: - A new enhanced frame - skipping method is proposed, which further reduces inference latency by finding the optimal frame prediction ratio during inference. ### Research Background and Motivation With the development of deep learning, computer vision technology has made significant progress in video analysis, action recognition and other fields. However, many existing solutions mainly focus on improving model performance and ignore the key factor of inference latency. Especially in real - time applications deployed in the cloud, low latency is the key to ensuring system response speed and user experience. Therefore, this paper aims to explore how to reduce inference latency by optimizing the topology of the Vision Transformer while maintaining model performance. ### Main Contributions 1. **Application of Vision Transformer**: - Use the Vision Transformer as the state - of - the - art technology for real - time video action recognition. 2. **Topology Optimization**: - Simplify the architecture of the Vision Transformer by reducing the input frame size and the number of multi - head self - attention modules, thereby reducing inference latency and maintaining high accuracy. 3. **Optimal Cloud Instance Selection**: - Select the optimal cloud instance to support the efficient training and inference processes. 4. **Enhanced Frame - Skipping Mechanism**: - Propose a new enhanced frame - skipping mechanism, which further reduces inference latency. ### Experimental Results - The experimental results of the optimized Vision Transformer model on the UCF101 dataset show that the model with an input size of 56×56×3 and 8 attention heads achieves an F1 score of 91.497%. - The optimized model reduces inference latency by 40.70% and the training time is 55.63% faster than the baseline model. - Through the enhanced frame - skipping mechanism, the inference latency is further reduced by 57.15%. In summary, this research shows the high optimizability of the Vision Transformer model in terms of inference latency while maintaining model performance, which is suitable for cloud solutions for real - time video action recognition.

Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution

Motion Guided Token Compression for Efficient Masked Video Modeling

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Evaluating Transformers for Lightweight Action Recognition

Training Strategies for Vision Transformers for Object Detection

RetinaViT: Efficient Visual Backbone for Online Video Streams

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

Efficient Action Recognition with Introducing R(2+1)D Convolution to Improved Transformer

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Co-training Transformer with Videos and Images Improves Action Recognition

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

Human-Centric Transformer for Domain Adaptive Action Recognition

Convolutional transformer network for fine-grained action recognition

Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

Efficient Video Action Detection with Token Dropout and Context Refinement.

Vision Transformer Computation and Resilience for Dynamic Inference

Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition