Abstract:Estimating rigid objects' poses is one of the fundamental problems in computer vision, with a range of applications across automation and augmented reality. Most existing approaches adopt one network per object class strategy, depend heavily on objects' 3D models, depth data, and employ a time-consuming iterative refinement, which could be impractical for some applications. This paper presents a novel approach, CVAM-Pose, for multi-object monocular pose estimation that addresses these limitations. The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space. This autoencoding process uses only images captured by a projective camera and is robust to objects' occlusion and scene clutter. The classes of objects are one-hot encoded and embedded throughout the network. The proposed label-embedded pose regression strategy interprets the learnt latent space representations utilising continuous pose representations. Ablation tests and systematic evaluations demonstrate the scalability and efficiency of the CVAM-Pose method for multi-object scenarios. The proposed CVAM-Pose outperforms competing latent space approaches. For example, it is respectively 25% and 20% better than AAE and Multi-Path methods, when evaluated using the $\mathrm{AR_{VSD}}$ metric on the Linemod-Occluded dataset. It also achieves results somewhat comparable to methods reliant on 3D models reported in BOP challenges. Code available: <a class="link-external link-https" href="https://github.com/JZhao12/CVAM-Pose" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to estimate the six - degree - of - freedom (6 - DoF) poses of multiple rigid - body objects in real - time in monocular images. Specifically, the paper proposes solutions to several major limitations of existing methods: 1. **Single - network multi - object processing**: Most existing methods adopt the "one - network - per - object - class" strategy, which not only consumes a large amount of resources but also has poor scalability in multi - object scenarios. The method proposed in the paper (CV AM - Pose) uses a single network to handle the pose estimation of multiple objects, improving efficiency and scalability. 2. **No dependence on 3D models and depth data**: Many existing methods require 3D models or depth data and usually need time - consuming iterative optimization during the inference process. The CV AM - Pose method only uses monocular images, does not require 3D models, depth data or post - processing optimization, and is applicable to a wider range of scenarios. 3. **Robustness**: Existing methods perform poorly when dealing with complex situations such as occlusion, lack of texture, truncation and cluttered scenes. CV AM - Pose enhances robustness to these challenging scenarios through the Conditional Variational Auto - Encoder (CV AE) and continuous pose regression strategy. ### Main contributions of the paper 1. **Conditional generation model**: For the first time, a conditional generation model is used to efficiently represent the poses of multiple objects, and the learning ability of high - level features is improved through the adaptive label embedding technique. 2. **Label embedding technique**: An inter - layer one - hot encoding technique is introduced to embed class labels into each layer of the network, enhancing the learning ability of multi - object representation. 3. **Regularized and constrained latent space**: A regularized and constrained latent space is constructed to represent multiple objects, so that a single latent space can be extended to multi - object representation without affecting pose accuracy. 4. **Continuous pose regression algorithm**: Avoids the discretization problem, and regresses a continuous pose representation from the latent space representation through a Multi - Layer Perceptron (MLP), achieving fast and accurate multi - object pose estimation. ### Experimental results - **Ablation experiments**: The effectiveness of the adaptive label embedding technique, latent space regularization and latent space dimension is verified through a series of ablation experiments. - **Benchmark tests**: Benchmark tests are carried out on the Linemod - Occluded dataset, and the results show that CV AM - Pose significantly outperforms other latent - representation - based methods such as AAE, AAE - ICP and Multi - Path in multiple metrics. - **Comparison with 3D - model - dependent methods**: Although not depending on 3D models, CV AM - Pose can still achieve results comparable to those of 3D - model - dependent methods (such as EPOS, CDPN and PVNet) in some metrics. ### Conclusion The paper proposes a new multi - object monocular pose estimation method, CV AM - Pose. Through the conditional generation model, label embedding technique and continuous pose regression strategy, it solves the deficiencies of existing methods in terms of resource consumption, robustness and real - time processing. This method performs well in dealing with complex scenes (such as occlusion and lack of texture), providing a new solution for 6 - DoF pose estimation.

CVAM-Pose: Conditional Variational Autoencoder for Multi-Object Monocular Pose Estimation

Temporal Consistent Object Pose Estimation from Monocular Videos

Unseen Object Pose Estimation via Registration

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

Learning a Robust Part-Aware Monocular 3D Human Pose Estimator via Neural Architecture Search

Estimation of 6D Pose of Objects Based on a Variant Adversarial Autoencoder

MORE: Simultaneous Multi-View 3D Object Recognition and Pose Estimation

Attention Guided 6D Object Pose Estimation with Multi-constraints Voting Network

Occlusion-Aware Self-Supervised Monocular 6D Object Pose Estimation.

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Joint Multi-Person Pose Estimation and Semantic Part Segmentation

Robust and Efficient Estimation of Absolute Camera Pose for Monocular Visual Odometry.

Category-level Pose Estimation and Iterative Refinement for Monocular RGB-D Image

Self-learning Canonical Space for Multi-view 3D Human Pose Estimation

MMDA: Multi-person marginal distribution awareness for monocular 3D pose estimation

Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation

Robust self-supervised monocular visual odometry based on prediction-update pose estimation network.

Multi-view object pose estimation from correspondence distributions and epipolar geometry

Leveraging Positional Encoding for Robust Multi-Reference-Based Object 6D Pose Estimation

High-resolution open-vocabulary object 6D pose estimation