Abstract:Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction, serving as a crucial technique for understanding and interacting with human actions in real-world settings. However, the current datasets, often collected under single laboratory conditions using complex motion capture equipment and unvarying backgrounds, are insufficient. The absence of datasets on variable conditions is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, multi-view dataset collected under the real-world conditions. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an semi-automated pipeline containing error detection to reduce the workload of manual check and ensure precise annotation. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. Code and data are available at

What problem does this paper attempt to address?

The paper primarily addresses the application challenges of 3D Human Pose Estimation (3D HPE) in real-world scenarios by proposing a new large-scale multi-view dataset called FreeMan. The research aims to solve the following key issues: 1. **Improving the generalization ability of models under real-world conditions**: Existing datasets are usually collected under laboratory conditions using complex motion capture equipment and have uniform backgrounds, leading to poor performance of trained models in real-world environments. 2. **Increasing scene diversity**: Most existing 3D HPE datasets are collected in controlled environments, resulting in limited variations in lighting conditions and backgrounds, which is a limitation for models that need to handle complex scenes. 3. **Expanding the range of actions and human scales**: The range of human actions in existing datasets is limited, and due to the use of fixed cameras, the size of humans in different videos is relatively fixed, lacking diversity. 4. **Enhancing the scalability of datasets**: Current datasets rely heavily on expensive manual processing for annotations, limiting the expansion of dataset scale. Especially with variable camera positions, how to effectively align and annotate data from different cameras remains an unresolved issue. To address the above problems, the researchers proposed the FreeMan dataset, a large-scale multi-view 3D HPE dataset collected under real-world conditions. This dataset includes 11 million frames of images synchronously captured from 8 smartphone cameras, covering performances by 40 participants in 10 different types of scenes. The features of the FreeMan dataset include: - Diverse backgrounds and lighting conditions, enhancing the model's generalization ability in real-world scenarios. - Significant variations in the distance between humans and cameras, leading to changes in human sizes, increasing the dataset's diversity. - A semi-automated annotation pipeline combined with an error detection mechanism, reducing manual workload and improving the scalability and annotation accuracy of the dataset. - The dataset is suitable for various tasks, including monocular 3D pose estimation, 2D to 3D lifting, multi-view 3D pose estimation, and human neural rendering. Experimental results show that models trained using the FreeMan dataset significantly outperform those trained with other existing datasets on the 3DPW test set, demonstrating the effectiveness of the FreeMan dataset in improving the real-world generalization ability of models.

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

Human-M3: A Multi-view Multi-modal Dataset for 3D Human Pose Estimation in Outdoor Scenes

FollowMeUp Sports: New Benchmark for 2D Human Keypoint Recognition

Towards Generalization of 3D Human Pose Estimation In The Wild

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

Recent Advances in 3D Human Pose Estimation: From Optimization to Implementation and Beyond

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Learning 3-D Human Pose Estimation from Catadioptric Videos

3D Human Pose Estimation with Single Image and Inertial Measurement Unit (IMU) Sequence

LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment

UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model

Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

Robust Estimation of 3D Human Poses from a Single Image

Generalizing Monocular 3d Human Pose Estimation In The Wild

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

A Survey on Monocular 3D Human Pose Estimation

VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data

SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation