Object Pose Estimation Using Implicit Representation For Transparent Objects

Varun Burde,Artem Moroz,Vit Zeman,Pavel Burget
2024-10-17
Abstract:Object pose estimation is a prominent task in computer vision. The object pose gives the orientation and translation of the object in real-world space, which allows various applications such as manipulation, augmented reality, etc. Various objects exhibit different properties with light, such as reflections, absorption, etc. This makes it challenging to understand the object's structure in RGB and depth channels. Recent research has been moving toward learning-based methods, which provide a more flexible and generalizable approach to object pose estimation utilizing deep learning. One such approach is the render-and-compare method, which renders the object from multiple views and compares it against the given 2D image, which often requires an object representation in the form of a CAD model. We reason that the synthetic texture of the CAD model may not be ideal for rendering and comparing operations. We showed that if the object is represented as an implicit (neural) representation in the form of Neural Radiance Field (NeRF), it exhibits a more realistic rendering of the actual scene and retains the crucial spatial features, which makes the comparison more versatile. We evaluated our NeRF implementation of the render-and-compare method on transparent datasets and found that it surpassed the current state-of-the-art results.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the 6D pose estimation problem of transparent objects. Specifically, the author aims to improve the pose estimation accuracy of transparent objects by using implicit representations (especially Neural Radiance Field, NeRF). Traditional methods face challenges when dealing with transparent, reflective or non - Lambertian surfaces, because the characteristics of these surfaces depend on specific viewing angles and backgrounds and are difficult to represent with conventional explicit textures. ### Main contributions of the paper 1. **Proposed a 6D pose estimation pipeline for transparent objects based on a single RGB image and sparse multi - view images**: This method does not require pre - training a pose estimator for specific object instances. 2. **Combined the classical rendering - comparison method with NeRF for view synthesis**: Utilize NeRF to generate high - quality and view - dependent transparent object hypotheses, thereby improving the accuracy of pose estimation. 3. **Tested the proposed method on four large - scale datasets**: These datasets contain transparent and reflective household items in complex environments and were evaluated using multiple evaluation metrics (such as MSPD, MSSD, ADD, ADD - S, translation error, rotation error and 3D IoU). ### Method overview 1. **Data collection**: Simulate realistic non - Lambertian properties by applying reflection or transmission shaders to CAD models and render high - quality images at different viewing angles to optimize NeRF. 2. **NeRF training**: Train NeRF using the volume rendering equation so that it can represent scenes with complex geometric structures and appearances. 3. **Coarse estimation and refinement block**: First, perform a coarse estimation on the sampled rendered views through a classification task, and then gradually refine the pose by iteratively adding small translation and rotation adjustments. 4. **Fine - tuning process**: Fine - tune MegaPose6D and NeRF view synthesis on a synthetic dataset containing transparent objects to improve performance on transparent objects. ### Evaluation results The experimental results show that this method outperforms existing methods on multiple benchmark datasets. Especially in the glass category, fine - tuning significantly improves the results, even under more stringent IoU thresholds. ### Conclusion The NeRF - based rendering - comparison method proposed in this paper demonstrates the potential for pose estimation of unseen transparent objects using only RGB images. As a representation form, NeRF shows more realistic rendering effects at different viewing angles, thereby improving the accuracy of pose estimation.