Abstract:Reconstructing multi-human body mesh from a single monocular image is an important but challenging computer vision problem. In addition to the individual body mesh models, we need to estimate relative 3D positions among subjects to generate a coherent representation. In this work, through a single graph neural network, named MUG (Multi-hUman Graph network), we construct coherent multi-human meshes using only multi-human 2D pose as input. Compared with existing methods, which adopt a detection-style pipeline (i.e., extracting image features and then locating human instances and recovering body meshes from that) and suffer from the significant domain gap between lab-collected training datasets and in-the-wild testing datasets, our method benefits from the 2D pose which has a relatively consistent geometric property across datasets. Our method works like the following: First, to model the multi-human environment, it processes multi-human 2D poses and builds a novel heterogeneous graph, where nodes from different people and within one person are connected to capture inter-human interactions and draw the body geometry (i.e., skeleton and mesh structure). Second, it employs a dual-branch graph neural network structure -- one for predicting inter-human depth relation and the other one for predicting root-joint-relative mesh coordinates. Finally, the entire multi-human 3D meshes are constructed by combining the output from both branches. Extensive experiments demonstrate that MUG outperforms previous multi-human mesh estimation methods on standard 3D human benchmarks -- Panoptic, MuPoTS-3D and 3DPW.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reconstruct the 3D mesh models of multiple humans in monocular images. Specifically, the author points out that the existing methods for 3D mesh reconstruction of multiple humans usually adopt a detection - style pipeline, that is, extracting features from the entire image, then locating human instances and recovering body meshes. However, there is a significant domain gap between the training data sets collected in the laboratory and the test data sets in the wild for this method, which limits its performance. To overcome this problem, the paper proposes a new method based on graph neural networks - MUG (Multi - human Graph network), which only uses the 2D poses of multiple humans as input to construct a coherent 3D mesh model of multiple humans. ### Main contributions of the paper 1. **New multi - human reconstruction process**: Use 2D poses as input and simultaneously output multi - human meshes and 3D positions through a single graph network. The geometric properties of 2D poses are relatively consistent in different data sets, making this method more robust to domain gaps. 2. **Novel graph neural network MUG**: Propose a heterogeneous graph convolutional network, which uses joint nodes, vertex nodes and different types of edges to represent the relationships within and between humans. 3. **Significantly outperforms existing methods**: On standard multi - person 3D human data sets (such as Panoptic, MuPoTS - 3D and 3DPW), this method significantly outperforms the previous state - of - the - art methods. ### Method overview 1. **Graph construction**: - Each human body corresponds to a sub - graph, and the sub - graph contains two types of nodes: joint nodes and mesh nodes. - Joint nodes are connected according to the human skeletal structure, and mesh nodes are connected according to the human mesh topology. - Each mesh node is connected to its two nearest joint nodes. - In order to represent the relationships between people, joint nodes across human bodies are also connected. 2. **Construction of node features**: - For joint nodes, mainly use 2D pose input for feature construction and perform normalization processing. - For mesh nodes, use the 2D position information of the nearest joint nodes when initializing features. 3. **Network structure**: - Design a two - branch graph network structure, one branch processes joint nodes, and the other branch processes mesh nodes. - The two branches are connected through a GCN block, and finally output the root node depth and the mesh coordinates related to the root node. 4. **Depth estimation**: - Infer the depth from the root node to the camera through 2D poses, and use the relative depth loss to improve the accuracy of depth estimation. 5. **Multi - human reconstruction**: - Combine the depth and the mesh coordinates related to the root node to calculate the absolute 3D mesh coordinates of each individual. ### Experimental results - **Human3.6M**: Although this is a single - human data set, the experimental results show the superior performance of MUG in the multi - human mesh reconstruction task. - **MuPoTS - 3D, Panoptic and 3DPW**: On these standard multi - person 3D human data sets, MUG significantly outperforms the existing state - of - the - art methods. ### Summary This paper successfully solves the problem of 3D mesh reconstruction of multiple humans in monocular images by introducing the MUG method, especially performing well in dealing with domain gaps. By using 2D poses as input, MUG can more accurately reconstruct the 3D mesh models of multiple humans.

MUG: Multi-human Graph Network for 3D Mesh Reconstruction from 2D Pose

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

Image-Guided Human Reconstruction via Multi-Scale Graph Transformation Networks

3D Human Reconstruction from A Single Depth Image

Graph U-Shaped Network with Mapping-Aware Local Enhancement for Single-Frame 3D Human Pose Estimation

Hierarchical Graph Networks for 3D Human Pose Estimation

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

Monocular Expressive 3D Human Reconstruction of Multiple People

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images

MMDA: Multi-person marginal distribution awareness for monocular 3D pose estimation

MH‐HMR: Human mesh recovery from monocular images via multi‐hypothesis learning

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose

MUC: Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction

Human Mesh Recovery from Arbitrary Multi-view Images

3D-UGCN: A Unified Graph Convolutional Network for Robust 3D Human Pose Estimation from Monocular RGB Images

Multi-view Shape Generation for a 3D Human-like Body

Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Multi-view Human Body Mesh Translator

Interweaved Graph and Attention Network for 3D Human Pose Estimation