Abstract:In recent years, there has been significant progress in human pose estimation, fueled by the widespread adoption of deep convolutional neural networks. However, despite these advancements, multi-person 2D pose estimation still remains highly challenging due to factors such as occlusion, noise, and non-rigid body movements. Currently, most multi-person pose estimation approaches handle joint localization and association separately. This study proposes a direct regression-based method to estimate the 2D human pose from a single image. The authors name this network YOLO-Rlepose. Compared to traditional methods, YOLO-Rlepose leverages Transformer models to better capture global dependencies between image feature blocks and preserves sufficient spatial information for keypoint detection through a multi-head self-attention mechanism. To further improve the accuracy of the YOLO-Rlepose model, this paper proposes the following enhancements. Firstly, this study introduces the C3 Module with Swin Transformer (C3STR). This module builds upon the C3 module in You Only Look Once (YOLO) by incorporating a Swin Transformer branch, enhancing the YOLO-Rlepose model's ability to capture global information and rich contextual information. Next, a novel loss function named Rle-Oks loss is proposed. The loss function facilitates the training process by learning the distributional changes through Residual Log-likelihood Estimation. To assign different weights based on the importance of different keypoints in the human body, this study introduces a weight coefficient into the loss function. The experiments proved the efficiency of the proposed YOLO-Rlepose model. On the COCO dataset, the model outperforms the previous SOTA method by 2.11% in AP.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges in multi - person pose estimation, especially the problems encountered when performing 2D pose estimation in complex scenes. Specifically, the research focuses on the following aspects: 1. **Occlusion, Noise and Non - Rigid Body Movements**: - In the multi - person pose estimation task, human joints may be occluded by other objects, body parts or themselves, making it difficult for pose estimation algorithms to accurately detect the position and angle of joints. - These factors make existing pose estimation algorithms face great challenges when dealing with complex real - world scenes. 2. **Limitations of Existing Methods**: - **Top - down methods**: These methods rely on human detectors to locate the bounding boxes of each human instance and then crop these instances for keypoint detection. If the performance of the human detector is poor, the accuracy of keypoint detection will also be affected. In addition, this method has a high computational cost, especially when there are a large number of human instances in the image, and the running time is longer. - **Heatmap - based methods**: Although these methods perform well, they require a large amount of computing and storage resources and are difficult to be applied to single - stage models. 3. **Improvement of Regression Methods**: - Although regression methods have a fast inference speed, they are easily affected by problems such as occlusion, motion blur and truncation in practical applications, resulting in a decline in performance. - Existing regression methods usually treat all keypoints equally and do not consider the differences in the importance of different keypoints. To solve these problems, the author proposes an improved YOLO model, named **YOLO - Rlepose**, and introduces the following innovations: - **Swin Transformer and C3STR Module**: By integrating the Swin Transformer into the C3 module (named C3STR), the model's ability to capture global information and rich context information is enhanced. - **Rle - Oks Loss Function**: A new loss function is proposed. Through Residual Log - likelihood Estimation (RLE), the change of the output distribution is simulated, and weight coefficients are introduced for different keypoints to reflect the differences in their importance. Through these improvements, the AP (Average Precision) of YOLO - Rlepose on the COCO dataset reaches 65.01%, which is 2.11% higher than the previous state - of - the - art method (YOLO - Pose).

YOLO-Rlepose: Improved YOLO Based on Swin Transformer and Rle-Oks Loss for Multi-Person Pose Estimation

KSL-POSE: A Real-Time 2D Human Pose Estimation Method Based on Modified YOLOv8-Pose Framework

YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation

MDA-YOLO Person: a 2D human pose estimation model based on YOLO detection framework

Research on Human Posture Estimation Algorithm Based on YOLO-Pose

Object Pose Estimation Based on Improved YOLOX Algorithm

An enhanced real-time human pose estimation method based on modified YOLOv8 framework

Unified End-to-End YOLOv5-HR-TCM Framework for Automatic 2D/3D Human Pose Estimation for Real-Time Applications

RNNPose: 6-DoF Object Pose Estimation Via Recurrent Correspondence Field Estimation and Pose Optimization

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation

Human Pose Estimation Based on Lightweight Multi-Scale Coordinate Attention

Pose Estimation for Swimmers in Video Surveillance

Learning Delicate Local Representations for Multi-person Pose Estimation

Classroom Student Posture Recognition Based on an Improved High-Resolution Network.

RFA-YOLO-POSE: A Fusion Algorithm for Pose Detection and Object Identification Amidst Complex Crowds

Shift Pose: A Lightweight Transformer-like Neural Network for Human Pose Estimation

Poseur: Direct Human Pose Regression with Transformers.

A study of human pose estimation in low-light environments using YOLOv8 model

A Deconvolutional Bottom-up Deep Network for Multi-Person Pose Estimation.

FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions