Abstract:Depth uncertainty is a core challenge in 3D human pose estimation, especially when the camera parameters are unknown. Previous methods try to reduce the impact of depth uncertainty by multi-view and/or multi-frame feature fusion to utilize more spatial and temporal information. However, they generally lead to marginal improvements and their performance still cannot match the camera-parameter-required methods. The reason is that their handcrafted fusion schemes cannot fuse the features flexibly, e.g., the multi-view and/or multi-frame features are fused separately. Moreover, the diverse and complicated fusion schemes make the principle for developing effective fusion schemes unclear and also raises an open problem that whether there exist more simple and elegant fusion schemes. To address these issues, this paper proposes an extremely concise unified feature fusion transformer (FusionFormer) with minimized handcrafted design for 3D pose estimation. FusionFormer fuses both the multi-view and multi-frame features in a unified fusion scheme, in which all the features are accessible to each other and thus can be fused flexibly. Experimental results on several mainstream datasets demonstrate that FusionFormer achieves state-of-the-art performance. To our best knowledge, this is the first camera-parameter-free method to outperform the existing camera-parameter-required methods, revealing the tremendous potential of camera-parameter-free models. These impressive experimental results together with our concise feature fusion scheme resolve the above open problem. Another appealing feature of FusionFormer we observe is that benefiting from its effective fusion scheme, we can achieve impressive performance with smaller model size and less FLOPs.

MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Adaptively Fusing Complete Multi-resolution Features for Human Pose Estimation.

FusePose: IMU-Vision Sensor Fusion in Kinematic Space for Parametric Human Pose Estimation

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation

FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation

AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction

Rotation-Constrained Cross-View Feature Fusion for Multi-View Appearance-based Gaze Estimation

Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Multi-View Human Mesh Reconstruction via Direction-Aware Feature Fusion

Mitigating imbalances in heterogeneous feature fusion for multi-class 6D pose estimation

Unbiased Feature Position Alignment for Human Pose Estimation

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

Gated Region-Refine Pose Transformer for Human Pose Estimation.

MetaFusion: Infrared and Visible Image Fusion Via Meta-Feature Embedding from Object Detection

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features.

A Transformer-based multi-modal fusion network for 6D pose estimation

Multi-view Pose Fusion for Occlusion-Aware 3D Human Pose Estimation