Multi-camera Temporal Grouping for Play/Break Event Detection in Soccer Games

Chunbo Song,Christopher Rasmussen
DOI: https://doi.org/10.1007/978-3-030-33720-9_18
2019-01-01
Abstract:AbstractMany current deep learning approaches to action recognition focus on recognizing concrete (e.g., single actor) actions in trimmed videos from datasets such as UCF-101 and HMDB-51. However, high-level semantic analysis of sports videos often requires recognizing more abstract events or situations involving multiple players with longer time-scale context. This paper builds upon inflated 3D (I3D) ConvNets for video action recognition to detect and differentiate six abstract categories of events in untrimmed videos of soccer games from multiple fixed cameras: normal play, plus breaks in play due to kick-offs, free kicks, throw-ins, and goal and corner kicks. Raw video unit classifications by variants of the basic I3D network are post-processed by two novel and efficient grouping methods for localizing the boundaries of events. Our experiments show that the proposed methods can achieve 84.2% weighted precision for event categories at the level of video units, and boost event temporal localization mean average precision at 0.5 tIoU (mAP@0.5) to 62.0%.
What problem does this paper attempt to address?