Long Video Scoring Method Fusing High-Precision Pose and Spatio-Temporal Attention Modules

Lina Chen,Junbo Zhang,Weijie Wu,Chaoyu Han,Hong Gao
DOI: https://doi.org/10.1007/978-981-97-7232-2_31
2024-01-01
Abstract:In recent years, it is a prominent study that applying artificial intelligence to action quality assessment of sports events. To overcome the incompleteness of feature information and the lack of critical feature information in existing video scoring methods. We proposed a long video scoring model that fuses high-precision posture and spatio-temporal attention modules. The main work of this paper are as follows. (1) Adopt HR-Net. This module can extract high-precision posture position information in specific frames, realise the efficient supplement of static streams to dynamic features, and improve the accuracy of athlete position information. (2) Improve the ACTION-NET network. The spatio-temporal attention module is innovatively used as the backbone of the attention mechanism, which provides the importance of fragments for the model and enhances the reliability degree of recognition and prediction scoring. (3) Conduct ablation experiments to compare the differences between different models. On MIT-Skate dataset, the experimental results show that the PoseACTION-NET method in this paper can predict scoring correlation coefficients up to 0.665, which is a 5% improvement over the ACTION-NET network. The predicted scoring correlation coefficients on the Rhythmic Gymnastics dataset reach 0.5460.6680.761 and 0.615 for Ball, Clubs, Hoop and Ribbon, respectively. This is an improvement of 1.8%, 1.1%, 5.3% and 3.7%, respectively, over the ACTION-NET network. This indicates that the fused posture and spatio-temporal attention modules not only supply additional position information, but also considerably increase the model prediction score capabilities.
What problem does this paper attempt to address?