CVAM-Pose: Conditional Variational Autoencoder for Multi-Object Monocular Pose Estimation

Jianyu Zhao,Wei Quan,Bogdan J. Matuszewski
2024-10-12
Abstract:Estimating rigid objects' poses is one of the fundamental problems in computer vision, with a range of applications across automation and augmented reality. Most existing approaches adopt one network per object class strategy, depend heavily on objects' 3D models, depth data, and employ a time-consuming iterative refinement, which could be impractical for some applications. This paper presents a novel approach, CVAM-Pose, for multi-object monocular pose estimation that addresses these limitations. The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space. This autoencoding process uses only images captured by a projective camera and is robust to objects' occlusion and scene clutter. The classes of objects are one-hot encoded and embedded throughout the network. The proposed label-embedded pose regression strategy interprets the learnt latent space representations utilising continuous pose representations. Ablation tests and systematic evaluations demonstrate the scalability and efficiency of the CVAM-Pose method for multi-object scenarios. The proposed CVAM-Pose outperforms competing latent space approaches. For example, it is respectively 25% and 20% better than AAE and Multi-Path methods, when evaluated using the $\mathrm{AR_{VSD}}$ metric on the Linemod-Occluded dataset. It also achieves results somewhat comparable to methods reliant on 3D models reported in BOP challenges. Code available: <a class="link-external link-https" href="https://github.com/JZhao12/CVAM-Pose" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to estimate the six - degree - of - freedom (6 - DoF) poses of multiple rigid - body objects in real - time in monocular images. Specifically, the paper proposes solutions to several major limitations of existing methods: 1. **Single - network multi - object processing**: Most existing methods adopt the "one - network - per - object - class" strategy, which not only consumes a large amount of resources but also has poor scalability in multi - object scenarios. The method proposed in the paper (CV AM - Pose) uses a single network to handle the pose estimation of multiple objects, improving efficiency and scalability. 2. **No dependence on 3D models and depth data**: Many existing methods require 3D models or depth data and usually need time - consuming iterative optimization during the inference process. The CV AM - Pose method only uses monocular images, does not require 3D models, depth data or post - processing optimization, and is applicable to a wider range of scenarios. 3. **Robustness**: Existing methods perform poorly when dealing with complex situations such as occlusion, lack of texture, truncation and cluttered scenes. CV AM - Pose enhances robustness to these challenging scenarios through the Conditional Variational Auto - Encoder (CV AE) and continuous pose regression strategy. ### Main contributions of the paper 1. **Conditional generation model**: For the first time, a conditional generation model is used to efficiently represent the poses of multiple objects, and the learning ability of high - level features is improved through the adaptive label embedding technique. 2. **Label embedding technique**: An inter - layer one - hot encoding technique is introduced to embed class labels into each layer of the network, enhancing the learning ability of multi - object representation. 3. **Regularized and constrained latent space**: A regularized and constrained latent space is constructed to represent multiple objects, so that a single latent space can be extended to multi - object representation without affecting pose accuracy. 4. **Continuous pose regression algorithm**: Avoids the discretization problem, and regresses a continuous pose representation from the latent space representation through a Multi - Layer Perceptron (MLP), achieving fast and accurate multi - object pose estimation. ### Experimental results - **Ablation experiments**: The effectiveness of the adaptive label embedding technique, latent space regularization and latent space dimension is verified through a series of ablation experiments. - **Benchmark tests**: Benchmark tests are carried out on the Linemod - Occluded dataset, and the results show that CV AM - Pose significantly outperforms other latent - representation - based methods such as AAE, AAE - ICP and Multi - Path in multiple metrics. - **Comparison with 3D - model - dependent methods**: Although not depending on 3D models, CV AM - Pose can still achieve results comparable to those of 3D - model - dependent methods (such as EPOS, CDPN and PVNet) in some metrics. ### Conclusion The paper proposes a new multi - object monocular pose estimation method, CV AM - Pose. Through the conditional generation model, label embedding technique and continuous pose regression strategy, it solves the deficiencies of existing methods in terms of resource consumption, robustness and real - time processing. This method performs well in dealing with complex scenes (such as occlusion and lack of texture), providing a new solution for 6 - DoF pose estimation.