ViT-MPI: Vision Transformer Multiplane Images for Surgical Single-View View Synthesis.

Chenming Han,Ruizhi Shao,Gaochang Wu,Hang Shao,Yebin Liu
DOI: https://doi.org/10.1007/978-981-99-8850-1_3
2024-01-01
Abstract:In this paper, we explore the use of a single imaging device to acquire immersive 3D perception in endoscopic surgery. To solve the heavily ill-posed problem caused by the unknown depth and unseen occlusion, we introduce a Vision Transformer (ViT)-based Multiplane Images (MPI) representation, termed as ViT-MPI, for the continuous novel view synthesis using single-view input. The MPI representation provides layered depth images to explicitly decode positional relationships between tissues. Instead of using the existing full convolutional network as the backbone of our MPI representation, we exploit the ViT architecture to collect tokens output from all stages of the transformer and combine them into feature representations with different resolutions. The interactions between tokens in the ViT provide accurate predictions of local and global positional relations, ensuring reliable view synthesis of occluded regions with fine-grained details. Experiments on real-captured endoscopic surgery images from the da Vinci Surgical Robot System demonstrate that our proposed approach enables the prediction of multi-view images from a single-view input. Moreover, our method produces reasonable depth maps, further enhancing its practical applicability.
What problem does this paper attempt to address?