Abstract:Transparent objects are commonly found in real life and industrial production. Unlike opaque objects, transparent objects are not easily identifiable in RGB images and often require depth information to determine their position in the image. However, due to the influence of other environmental factors such as reflection and refraction, the depth information of transparent objects is often inaccurate. This leads to difficulties for robots in grasping transparent objects, as incorrect depth information can result in the robot being unable to predict or predict incorrectly the grasping pose. Therefore, it is necessary to complete the depth information for transparent objects. Previous methods for depth completion of transparent objects often struggle to balance accuracy and real-time performance simultaneously. To achieve this goal, in this paper, we propose a transparent object depth completion network called TCRNet based on a cascade refinement structure, which balances accuracy and real-time performance simultaneously. First, the network incorporates a cascade refinement structure in the decoding stage to refine features multiple times, improving the accuracy of depth information. Additionally, an attention module is designed to adjust the extracted features, enabling the network to focus on depth information features in transparent object regions. Finally, a transformer-based error module is implemented in the network’s final output stage to predict and adjust the error between the depth image and the ground truth. TCRNet is trained and tested on three datasets: ClearGrasp, Omniverse Object, and TransCG. It outperforms previous methods in terms of performance. Furthermore, TCRNet is applied to existing grasp detection methods to conduct grasping experiments on transparent objects using a real Baxter robot. Note to Practitioners —With the development of RGB-D camera technology, RGB-D cameras are now widely used in various scenarios such as industrial production, autonomous driving, and robot grasping. However, in certain situations where the camera faces transparent or highly reflective objects, the depth information captured by the camera is often not accurate enough, which can lead to subsequent accidents. Therefore, it is necessary to repair and complete the depth images to achieve accurate understanding of the scene’s depth information. In recent years, with the advancement of deep learning, deep learning-based depth image processing and restoration techniques have been widely applied. In this paper, we propose a high-accuracy network for repairing depth images of transparent objects, which can accurately restore and estimate the depth information of transparent objects in various scenarios. Moreover, experimental results demonstrate that our proposed method can generalize well to other unknown scenes, achieving excellent results.

TODE-Trans: Transparent Object Depth Estimation with Transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

TDCNet: Transparent Objects Depth Completion with CNN-Transformer Dual-Branch Parallel Network

To-Former: semantic segmentation of transparent object with edge-enhanced transformer

DFTR: Depth-supervised Fusion Transformer for Salient Object Detection

DFNet-Trans: An end-to-end multibranching network for depth estimation for transparent objects

ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

Introducing Depth into Transformer-based 3D Object Detection

Transformer Transforms Salient Object Detection and Camouflaged Object Detection

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance

TCRNet: Transparent Object Depth Completion with Cascade Refinements

TinyDepth: Lightweight Self-Supervised Monocular Depth Estimation Based on Transformer

Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World

MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

TransCG: A Large-Scale Real-World Dataset for Transparent Object Depth Completion and a Grasping Baseline

Segmenting Transparent Object in the Wild with Transformer

Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

Deformable DETR: Deformable Transformers for End-to-End Object Detection

TransDSSL: Transformer Based Depth Estimation via Self-Supervised Learning