Abstract:As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition, and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to develop a comprehensive multimodal framework for understanding football videos. Specifically, the paper attempts to solve the following problems: 1. **Limitations of the dataset**: - Existing football video datasets (such as the SoccerNet series) cover 500 full - match videos, but mainly focus on designing dedicated models for specific tasks, resulting in poor compatibility between models. This limits the ability to comprehensively understand and analyze football videos. 2. **Generality and adaptability of the model**: - Existing research mainly focuses on specific tasks, such as event classification, comment generation, etc., lacking a general - purpose model that can handle multiple tasks uniformly. This requires retraining or adjusting the model when switching between different tasks, increasing complexity and cost. 3. **Data quality and diversity**: - The quality and diversity of existing datasets are limited and cannot fully support complex football video understanding tasks. For example, detailed annotations, multi - view videos, and the integration of modern football rules (such as VAR) are lacking. ### Main contributions of the paper To solve the above problems, the paper has made the following important contributions: 1. **Constructing a large - scale multimodal dataset**: - SoccerReplay - 1988 is introduced, which is the largest multimodal football dataset to date, containing videos and detailed annotations of 1,988 full - match games. This dataset provides a solid foundation for developing more powerful football understanding models. 2. **Proposing a vision - language foundation model**: - MatchVision is developed, which is the first vision - language foundation model specifically for the football field. MatchVision can effectively utilize the spatio - temporal information in football videos and performs excellently in various downstream tasks, such as event classification, comment generation, etc. 3. **Establishing a more comprehensive benchmark test**: - Based on the SoccerReplay - 1988 dataset, a more comprehensive and challenging benchmark test is established to evaluate the performance of football understanding models. These benchmark tests are not only larger in scale but also contain finer - grained event labels and rich text comments. 4. **Extensive experiments and ablation studies**: - Through extensive experiments and ablation studies, the superior performance of the proposed dataset and model in various downstream tasks is verified, reaching the state - of - the - art level of existing benchmark tests. ### Conclusion By constructing a large - scale multimodal dataset and developing a general - purpose vision - language foundation model, the paper solves the data limitations and model generality problems in existing research, providing a new paradigm for football video understanding research.

Towards Universal Soccer Video Understanding

Deep Understanding of Soccer Match Videos

SoccerDB: A Large-Scale Database for Comprehensive Video Understanding

Scaling up SoccerNet with multi-view spatial localization and re-identification

MatchTime: Towards Automatic Soccer Game Commentary Generation

A statistical framework for replay detection in soccer video

End-to-end soccer video scene and event classification with deep transfer learning

A Novel Dataset for Multi-View Multi-Player Tracking in Soccer Scenarios

Video-based Analysis of Soccer Matches

Sports Video Analysis on Large-Scale Data

Visual Soccer Analytics: Understanding the Characteristics of Collective Team Movement Based on Feature-Driven Analysis and Abstraction

Soccer: Who Has the Ball? Generating Visual Analytics and Player Statistics.

Soccer match broadcast video analysis method based on detection and tracking

A Multi-stage deep architecture for summary generation of soccer videos

Replay Scene Classification in Soccer Video Using Web Broadcast Text

Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos

Open Dataset Recorded by Single Cameras for Multi-Player Tracking in Soccer Scenarios

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

An Automatic Multi-Camera-based Event Extraction System for Real Soccer Videos.

A survey on soccer player detection and tracking with videos