Abstract:The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species. Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels. While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem. Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets. Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments. Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. To demonstrate the scalability of our approach, we introduce GenZoo, a synthetic dataset containing one million images of distinct subjects. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world animal pose and shape estimation benchmark, despite being trained solely on synthetic data. <a class="link-external link-https" href="https://genzoo.is.tue.mpg.de" rel="external noopener nofollow">this https URL</a>

Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop

BARC: Learning to Regress 3D Dog Shape from Images by Exploiting Breed Information

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data

LASSIE: Learning Articulated Shapes from Sparse Image Ensemble via 3D Part Discovery

S‐LASSIE: Structure and smoothness enhanced learning from sparse image ensemble for 3D articulated shape reconstruction

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

3D Menagerie: Modeling the 3D Shape and Pose of Animals

SyDog: A Synthetic Dog Dataset for Improved 2D Pose Estimation

Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery

Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape

3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View

Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video

Joint Representation of Multiple Geometric Priors via a Shape Decomposition Model for Single Monocular 3D Pose Estimation.

Pose Recognition in the Wild: Animal pose estimation using Agglomerative Clustering and Contrastive Learning

Prior-Aware Synthetic Data to the Rescue: Animal Pose Estimation with Very Limited Real Data

SemiMultiPose: A Semi-supervised Multi-animal Pose Estimation Framework

Unified 3D Mesh Recovery of Humans and Animals by Learning Animal Exercise

Multi-animal pose estimation, identification and tracking with DeepLabCut

Generative Zoo

Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation