Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution

Saman Sarraf,Milton Kabia
DOI: https://doi.org/10.3390/make5040067
2023-09-29
Machine Learning and Knowledge Extraction
Abstract:This study introduces an optimal topology of vision transformers for real-time video action recognition in a cloud-based solution. Although model performance is a key criterion for real-time video analysis use cases, inference latency plays a more crucial role in adopting such technology in real-world scenarios. Our objective is to reduce the inference latency of the solution while admissibly maintaining the vision transformer’s performance. Thus, we employed the optimal cloud components as the foundation of our machine learning pipeline and optimized the topology of vision transformers. We utilized UCF101, including more than one million action recognition video clips. The modeling pipeline consists of a preprocessing module to extract frames from video clips, training two-dimensional (2D) vision transformer models, and deep learning baselines. The pipeline also includes a postprocessing step to aggregate the frame-level predictions to generate the video-level predictions at inference. The results demonstrate that our optimal vision transformer model with an input dimension of 56 × 56 × 3 with eight attention heads produces an F1 score of 91.497% for the testing set. The optimized vision transformer reduces the inference latency by 40.70%, measured through a batch-processing approach, with a 55.63% faster training time than the baseline. Lastly, we developed an enhanced skip-frame approach to improve the inference latency by finding an optimal ratio of frames for prediction at inference, where we could further reduce the inference latency by 57.15%. This study reveals that the vision transformer model is highly optimizable for inference latency while maintaining the model performance.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in real - time video action recognition, how to optimize the topology of the Vision Transformer to achieve low inference latency in the cloud environment while maintaining model performance. Specifically, the paper focuses on the following aspects: 1. **Inference Latency in Real - Time Applications**: - In real - time video analysis, although model performance is a key criterion, inference latency (i.e., the time from input to output prediction) is more important for practical deployment. Therefore, the goal of the research is to reduce inference latency within an acceptable performance range. 2. **Optimization of Vision Transformer**: - The author reduces inference latency by optimizing the topology of the Vision Transformer. This includes selecting the optimal cloud computing components and adjusting the parameters of the Vision Transformer, such as the input frame size and the number of multi - head self - attention modules. 3. **Enhanced Frame - Skipping Mechanism**: - A new enhanced frame - skipping method is proposed, which further reduces inference latency by finding the optimal frame prediction ratio during inference. ### Research Background and Motivation With the development of deep learning, computer vision technology has made significant progress in video analysis, action recognition and other fields. However, many existing solutions mainly focus on improving model performance and ignore the key factor of inference latency. Especially in real - time applications deployed in the cloud, low latency is the key to ensuring system response speed and user experience. Therefore, this paper aims to explore how to reduce inference latency by optimizing the topology of the Vision Transformer while maintaining model performance. ### Main Contributions 1. **Application of Vision Transformer**: - Use the Vision Transformer as the state - of - the - art technology for real - time video action recognition. 2. **Topology Optimization**: - Simplify the architecture of the Vision Transformer by reducing the input frame size and the number of multi - head self - attention modules, thereby reducing inference latency and maintaining high accuracy. 3. **Optimal Cloud Instance Selection**: - Select the optimal cloud instance to support the efficient training and inference processes. 4. **Enhanced Frame - Skipping Mechanism**: - Propose a new enhanced frame - skipping mechanism, which further reduces inference latency. ### Experimental Results - The experimental results of the optimized Vision Transformer model on the UCF101 dataset show that the model with an input size of 56×56×3 and 8 attention heads achieves an F1 score of 91.497%. - The optimized model reduces inference latency by 40.70% and the training time is 55.63% faster than the baseline model. - Through the enhanced frame - skipping mechanism, the inference latency is further reduced by 57.15%. In summary, this research shows the high optimizability of the Vision Transformer model in terms of inference latency while maintaining model performance, which is suitable for cloud solutions for real - time video action recognition.