Abstract:Reconstructing multi-human body mesh from a single monocular image is an important but challenging computer vision problem. In addition to the individual body mesh models, we need to estimate relative 3D positions among subjects to generate a coherent representation. In this work, through a single graph neural network, named MUG (Multi-hUman Graph network), we construct coherent multi-human meshes using only multi-human 2D pose as input. Compared with existing methods, which adopt a detection-style pipeline (i.e., extracting image features and then locating human instances and recovering body meshes from that) and suffer from the significant domain gap between lab-collected training datasets and in-the-wild testing datasets, our method benefits from the 2D pose which has a relatively consistent geometric property across datasets. Our method works like the following: First, to model the multi-human environment, it processes multi-human 2D poses and builds a novel heterogeneous graph, where nodes from different people and within one person are connected to capture inter-human interactions and draw the body geometry (i.e., skeleton and mesh structure). Second, it employs a dual-branch graph neural network structure -- one for predicting inter-human depth relation and the other one for predicting root-joint-relative mesh coordinates. Finally, the entire multi-human 3D meshes are constructed by combining the output from both branches. Extensive experiments demonstrate that MUG outperforms previous multi-human mesh estimation methods on standard 3D human benchmarks -- Panoptic, MuPoTS-3D and 3DPW.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reconstruct the 3D mesh models of multiple humans in monocular images. Specifically, the author points out that the existing methods for 3D mesh reconstruction of multiple humans usually adopt a detection - style pipeline, that is, extracting features from the entire image, then locating human instances and recovering body meshes. However, there is a significant domain gap between the training data sets collected in the laboratory and the test data sets in the wild for this method, which limits its performance. To overcome this problem, the paper proposes a new method based on graph neural networks - MUG (Multi - human Graph network), which only uses the 2D poses of multiple humans as input to construct a coherent 3D mesh model of multiple humans.
### Main contributions of the paper
1. **New multi - human reconstruction process**: Use 2D poses as input and simultaneously output multi - human meshes and 3D positions through a single graph network. The geometric properties of 2D poses are relatively consistent in different data sets, making this method more robust to domain gaps.
2. **Novel graph neural network MUG**: Propose a heterogeneous graph convolutional network, which uses joint nodes, vertex nodes and different types of edges to represent the relationships within and between humans.
3. **Significantly outperforms existing methods**: On standard multi - person 3D human data sets (such as Panoptic, MuPoTS - 3D and 3DPW), this method significantly outperforms the previous state - of - the - art methods.
### Method overview
1. **Graph construction**:
- Each human body corresponds to a sub - graph, and the sub - graph contains two types of nodes: joint nodes and mesh nodes.
- Joint nodes are connected according to the human skeletal structure, and mesh nodes are connected according to the human mesh topology.
- Each mesh node is connected to its two nearest joint nodes.
- In order to represent the relationships between people, joint nodes across human bodies are also connected.
2. **Construction of node features**:
- For joint nodes, mainly use 2D pose input for feature construction and perform normalization processing.
- For mesh nodes, use the 2D position information of the nearest joint nodes when initializing features.
3. **Network structure**:
- Design a two - branch graph network structure, one branch processes joint nodes, and the other branch processes mesh nodes.
- The two branches are connected through a GCN block, and finally output the root node depth and the mesh coordinates related to the root node.
4. **Depth estimation**:
- Infer the depth from the root node to the camera through 2D poses, and use the relative depth loss to improve the accuracy of depth estimation.
5. **Multi - human reconstruction**:
- Combine the depth and the mesh coordinates related to the root node to calculate the absolute 3D mesh coordinates of each individual.
### Experimental results
- **Human3.6M**: Although this is a single - human data set, the experimental results show the superior performance of MUG in the multi - human mesh reconstruction task.
- **MuPoTS - 3D, Panoptic and 3DPW**: On these standard multi - person 3D human data sets, MUG significantly outperforms the existing state - of - the - art methods.
### Summary
This paper successfully solves the problem of 3D mesh reconstruction of multiple humans in monocular images by introducing the MUG method, especially performing well in dealing with domain gaps. By using 2D poses as input, MUG can more accurately reconstruct the 3D mesh models of multiple humans.