Abstract:Introduction: This paper presents an innovative Intelligent Robot Sports Competition Tactical Analysis Model that leverages multimodal perception to tackle the pressing challenge of analyzing opponent tactics in sports competitions. The current landscape of sports competition analysis necessitates a comprehensive understanding of opponent strategies. However, traditional methods are often constrained to a single data source or modality, limiting their ability to capture the intricate details of opponent tactics. Methods: Our system integrates the Swin Transformer and CLIP models, harnessing cross-modal transfer learning to enable a holistic observation and analysis of opponent tactics. The Swin Transformer is employed to acquire knowledge about opponent action postures and behavioral patterns in basketball or football games, while the CLIP model enhances the system's comprehension of opponent tactical information by establishing semantic associations between images and text. To address potential imbalances and biases between these models, we introduce a cross-modal transfer learning technique that mitigates modal bias issues, thereby enhancing the model's generalization performance on multimodal data. Results: Through cross-modal transfer learning, tactical information learned from images by the Swin Transformer is effectively transferred to the CLIP model, providing coaches and athletes with comprehensive tactical insights. Our method is rigorously tested and validated using Sport UV, Sports-1M, HMDB51, and NPU RGB+D datasets. Experimental results demonstrate the system's impressive performance in terms of prediction accuracy, stability, training time, inference time, number of parameters, and computational complexity. Notably, the system outperforms other models, with a remarkable 8.47% lower prediction error (MAE) on the Kinetics dataset, accompanied by a 72.86-second reduction in training time. Discussion: The presented system proves to be highly suitable for real-time sports competition assistance and analysis, offering a novel and effective approach for an Intelligent Robot Sports Competition Tactical Analysis Model that maximizes the potential of multimodal perception technology. By harnessing the synergies between the Swin Transformer and CLIP models, we address the limitations of traditional methods and significantly advance the field of sports competition analysis. This innovative model opens up new avenues for comprehensive tactical analysis in sports, benefiting coaches, athletes, and sports enthusiasts alike.

A Multi-Modal Transformer Approach for Football Event Classification

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Semantic Event Extraction From Basketball Games Using Multi-Modal Analysis

Sports Video Classification Method Based on Improved Deep Learning

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

A Transformer-based System for Action Spotting in Soccer Videos

The intelligent football players' motion recognition system based on convolutional neural network and big data

MVF-Net: A Multi-view Fusion Network for Event-based Object Classification

Volleyball training video classification description using the BiLSTM fusion attention mechanism

A Fusion Scheme of Visual and Auditory Modalities for Event Detection in Sports Video.

A Multimodal Transformer for Live Streaming Highlight Prediction

Multi-Mode Semantic Cues Based on Hidden Conditional Random Field in Soccer Video

Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP

Event Detection In Basketball Video Using Multiple Modalities

Fusing Multi-Stream Deep Networks for Video Classification

TransMed: Transformers Advance Multi-Modal Medical Image Classification

Transformer-Based Classification Outcome Prediction for Multimodal Stroke Treatment

Multimodal Deep Representation Learning for Video Classification

Transformer-Based Neural Marked Spatio Temporal Point Process Model for Football Match Events Analysis

A Multi-Modal Transformer Network for Action Detection