YOLO-Rlepose: Improved YOLO Based on Swin Transformer and Rle-Oks Loss for Multi-Person Pose Estimation

Yi Jiang,Kexin Yang,Jinlin Zhu,Li Qin
DOI: https://doi.org/10.3390/electronics13030563
IF: 2.9
2024-01-31
Electronics
Abstract:In recent years, there has been significant progress in human pose estimation, fueled by the widespread adoption of deep convolutional neural networks. However, despite these advancements, multi-person 2D pose estimation still remains highly challenging due to factors such as occlusion, noise, and non-rigid body movements. Currently, most multi-person pose estimation approaches handle joint localization and association separately. This study proposes a direct regression-based method to estimate the 2D human pose from a single image. The authors name this network YOLO-Rlepose. Compared to traditional methods, YOLO-Rlepose leverages Transformer models to better capture global dependencies between image feature blocks and preserves sufficient spatial information for keypoint detection through a multi-head self-attention mechanism. To further improve the accuracy of the YOLO-Rlepose model, this paper proposes the following enhancements. Firstly, this study introduces the C3 Module with Swin Transformer (C3STR). This module builds upon the C3 module in You Only Look Once (YOLO) by incorporating a Swin Transformer branch, enhancing the YOLO-Rlepose model's ability to capture global information and rich contextual information. Next, a novel loss function named Rle-Oks loss is proposed. The loss function facilitates the training process by learning the distributional changes through Residual Log-likelihood Estimation. To assign different weights based on the importance of different keypoints in the human body, this study introduces a weight coefficient into the loss function. The experiments proved the efficiency of the proposed YOLO-Rlepose model. On the COCO dataset, the model outperforms the previous SOTA method by 2.11% in AP.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges in multi - person pose estimation, especially the problems encountered when performing 2D pose estimation in complex scenes. Specifically, the research focuses on the following aspects: 1. **Occlusion, Noise and Non - Rigid Body Movements**: - In the multi - person pose estimation task, human joints may be occluded by other objects, body parts or themselves, making it difficult for pose estimation algorithms to accurately detect the position and angle of joints. - These factors make existing pose estimation algorithms face great challenges when dealing with complex real - world scenes. 2. **Limitations of Existing Methods**: - **Top - down methods**: These methods rely on human detectors to locate the bounding boxes of each human instance and then crop these instances for keypoint detection. If the performance of the human detector is poor, the accuracy of keypoint detection will also be affected. In addition, this method has a high computational cost, especially when there are a large number of human instances in the image, and the running time is longer. - **Heatmap - based methods**: Although these methods perform well, they require a large amount of computing and storage resources and are difficult to be applied to single - stage models. 3. **Improvement of Regression Methods**: - Although regression methods have a fast inference speed, they are easily affected by problems such as occlusion, motion blur and truncation in practical applications, resulting in a decline in performance. - Existing regression methods usually treat all keypoints equally and do not consider the differences in the importance of different keypoints. To solve these problems, the author proposes an improved YOLO model, named **YOLO - Rlepose**, and introduces the following innovations: - **Swin Transformer and C3STR Module**: By integrating the Swin Transformer into the C3 module (named C3STR), the model's ability to capture global information and rich context information is enhanced. - **Rle - Oks Loss Function**: A new loss function is proposed. Through Residual Log - likelihood Estimation (RLE), the change of the output distribution is simulated, and weight coefficients are introduced for different keypoints to reflect the differences in their importance. Through these improvements, the AP (Average Precision) of YOLO - Rlepose on the COCO dataset reaches 65.01%, which is 2.11% higher than the previous state - of - the - art method (YOLO - Pose).