Jianyu Zhao,Wei Quan,Bogdan J. Matuszewski
Abstract:Estimating rigid objects' poses is one of the fundamental problems in computer vision, with a range of applications across automation and augmented reality. Most existing approaches adopt one network per object class strategy, depend heavily on objects' 3D models, depth data, and employ a time-consuming iterative refinement, which could be impractical for some applications. This paper presents a novel approach, CVAM-Pose, for multi-object monocular pose estimation that addresses these limitations. The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space. This autoencoding process uses only images captured by a projective camera and is robust to objects' occlusion and scene clutter. The classes of objects are one-hot encoded and embedded throughout the network. The proposed label-embedded pose regression strategy interprets the learnt latent space representations utilising continuous pose representations. Ablation tests and systematic evaluations demonstrate the scalability and efficiency of the CVAM-Pose method for multi-object scenarios. The proposed CVAM-Pose outperforms competing latent space approaches. For example, it is respectively 25% and 20% better than AAE and Multi-Path methods, when evaluated using the $\mathrm{AR_{VSD}}$ metric on the Linemod-Occluded dataset. It also achieves results somewhat comparable to methods reliant on 3D models reported in BOP challenges. Code available: <a class="link-external link-https" href="https://github.com/JZhao12/CVAM-Pose" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to estimate the six - degree - of - freedom (6 - DoF) poses of multiple rigid - body objects in real - time in monocular images. Specifically, the paper proposes solutions to several major limitations of existing methods:
1. **Single - network multi - object processing**: Most existing methods adopt the "one - network - per - object - class" strategy, which not only consumes a large amount of resources but also has poor scalability in multi - object scenarios. The method proposed in the paper (CV AM - Pose) uses a single network to handle the pose estimation of multiple objects, improving efficiency and scalability.
2. **No dependence on 3D models and depth data**: Many existing methods require 3D models or depth data and usually need time - consuming iterative optimization during the inference process. The CV AM - Pose method only uses monocular images, does not require 3D models, depth data or post - processing optimization, and is applicable to a wider range of scenarios.
3. **Robustness**: Existing methods perform poorly when dealing with complex situations such as occlusion, lack of texture, truncation and cluttered scenes. CV AM - Pose enhances robustness to these challenging scenarios through the Conditional Variational Auto - Encoder (CV AE) and continuous pose regression strategy.
### Main contributions of the paper
1. **Conditional generation model**: For the first time, a conditional generation model is used to efficiently represent the poses of multiple objects, and the learning ability of high - level features is improved through the adaptive label embedding technique.
2. **Label embedding technique**: An inter - layer one - hot encoding technique is introduced to embed class labels into each layer of the network, enhancing the learning ability of multi - object representation.
3. **Regularized and constrained latent space**: A regularized and constrained latent space is constructed to represent multiple objects, so that a single latent space can be extended to multi - object representation without affecting pose accuracy.
4. **Continuous pose regression algorithm**: Avoids the discretization problem, and regresses a continuous pose representation from the latent space representation through a Multi - Layer Perceptron (MLP), achieving fast and accurate multi - object pose estimation.
### Experimental results
- **Ablation experiments**: The effectiveness of the adaptive label embedding technique, latent space regularization and latent space dimension is verified through a series of ablation experiments.
- **Benchmark tests**: Benchmark tests are carried out on the Linemod - Occluded dataset, and the results show that CV AM - Pose significantly outperforms other latent - representation - based methods such as AAE, AAE - ICP and Multi - Path in multiple metrics.
- **Comparison with 3D - model - dependent methods**: Although not depending on 3D models, CV AM - Pose can still achieve results comparable to those of 3D - model - dependent methods (such as EPOS, CDPN and PVNet) in some metrics.
### Conclusion
The paper proposes a new multi - object monocular pose estimation method, CV AM - Pose. Through the conditional generation model, label embedding technique and continuous pose regression strategy, it solves the deficiencies of existing methods in terms of resource consumption, robustness and real - time processing. This method performs well in dealing with complex scenes (such as occlusion and lack of texture), providing a new solution for 6 - DoF pose estimation.