A Multi-Modal Transformer Approach for Football Event Classification

Yixiao Zhang,Baihua Li,Hui Fang,Qinggang Meng
DOI: https://doi.org/10.1109/icip49359.2023.10223172
2023-01-01
Abstract:Video understanding has been enhanced by the use of multi-modal networks. However, recent multi-modal video analysis models have limited applicability to sports videos due to their specialised nature. This paper proposes a novel attention-based multi-modal neural network for sports event classification featuring a multi-stage fusion training strategy. The proposed multi-modal neural network integrates three modalities, including an image sequence modality, an audio modality and a newly proposed sports formation modality, to improve the sports video classification performance. Empirical results show that the proposed model outperforms the state-of-the-art transformer-based video method by 4.43% on top-1 accuracy on Soccernet-V2 dataset.
What problem does this paper attempt to address?