An enhanced real-time human pose estimation method based on modified YOLOv8 framework

Chengang Dong,Guodong Du
DOI: https://doi.org/10.1038/s41598-024-58146-z
IF: 4.6
2024-04-07
Scientific Reports
Abstract:The objective of human pose estimation (HPE) derived from deep learning aims to accurately estimate and predict the human body posture in images or videos via the utilization of deep neural networks. However, the accuracy of real-time HPE tasks is still to be improved due to factors such as partial occlusion of body parts and limited receptive field of the model. To alleviate the accuracy loss caused by these issues, this paper proposes a real-time HPE model called based on the YOLOv8 framework. Specifically, we have improved the backbone and neck of the YOLOv8x-pose real-time HPE model to alleviate the feature loss and receptive field constraints. Secondly, we introduce the context coordinate attention module (CCAM) to augment the model's focus on salient features, reduce background noise interference, alleviate key point regression failure caused by limb occlusion, and improve the accuracy of pose estimation. Our approach attains competitive results on multiple metrics of two open-source datasets, MS COCO 2017 and CrowdPose. Compared with the baseline model YOLOv8x-pose, CCAM-Person improves the average precision by 2.8% and 3.5% on the two datasets, respectively.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper aims to address several key issues in real-time Human Pose Estimation (HPE): 1. **Inaccurate keypoint localization due to limited receptive field**: Existing real-time HPE methods suffer from inaccurate keypoint localization due to limited receptive field or loss of original features. 2. **Pose estimation failure due to occlusion**: When human body parts are occluded, existing methods struggle to accurately estimate the pose. To solve these problems, the paper proposes an improved model based on the YOLOv8 framework—CCAM-Person. Specifically, the model is optimized through the following points: 1. **Multi-Scale Receptive Field Module (MRF)**: Introduces the MRF module in the Backbone part to aggregate more low-level features, improving the accuracy of human pose estimation at different scales. 2. **Multi-Path Feature Pyramid Network (MFPN)**: Replaces the original PANet structure to achieve more efficient cross-layer feature fusion, reducing information loss. 3. **Contextual Coordinate Attention Module (CCAM)**: Enhances the focus on salient features, reduces background noise interference, and alleviates keypoint regression failure caused by limb occlusion, thereby improving pose estimation accuracy. With these improvements, the CCAM-Person model outperforms the baseline model YOLOv8x-pose on two open-source datasets (MS COCO 2017 and CrowdPose), with an average precision improvement of 2.8% and 3.5%, respectively.