PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image

Weichao Shen,Yuan Dong,Zonghao Chen,Zhengyi Zhao,Yang Gao,Zhu Liu
DOI: https://doi.org/10.48550/arXiv.2212.12156
2022-12-23
Abstract:In this paper, we propose PanoViT, a panorama vision transformer to estimate the room layout from a single panoramic image. Compared to CNN models, our PanoViT is more proficient in learning global information from the panoramic image for the estimation of complex room layouts. Considering the difference between a perspective image and an equirectangular image, we design a novel recurrent position embedding and a patch sampling method for the processing of panoramic images. In addition to extracting global information, PanoViT also includes a frequency-domain edge enhancement module and a 3D loss to extract local geometric features in a panoramic image. Experimental results on several datasets demonstrate that our method outperforms state-of-the-art solutions in room layout prediction accuracy.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is room layout estimation from a single panoramic image. Specifically, the authors propose a new model - PanoViT, which is a Vision Transformer - based method for estimating room layouts from 360 - degree panoramic images. Compared with traditional Convolutional Neural Network (CNN) models, PanoViT is better at learning global information in panoramic images, which is especially important for estimating complex room layouts. ### Main Problems and Solutions 1. **Limitations of Traditional Methods**: - Convolutional Neural Networks (CNN) have limitations when dealing with complex room layouts because convolutional filters are more suitable for extracting local features, while estimating complex room layouts requires obtaining global information from panoramic images. - Panoramic images have significant differences in geometric properties from perspective images, which makes directly using models pre - trained on perspective image datasets unable to obtain satisfactory room layout estimation results. 2. **Innovations of PanoViT**: - **Multi - scale Feature Extraction**: PanoViT combines multi - scale features extracted by the CNN backbone network. These features are input into the Vision Transformer together with the original panoramic image, enhancing the ability to learn global information. - **New Position Embedding**: A Recurrent Position Embedding is designed to adapt to the special properties of panoramic images, especially the translational invariance in the horizontal direction. - **Frequency - domain Edge Enhancement**: A frequency - domain edge enhancement module is introduced. Edge information in the panoramic image is extracted through Fourier transform and inverse transform, further improving the performance of the model. - **3D Loss Function**: A 3D loss function based on the geometric information of panoramic images is designed to more accurately measure the error between the prediction result and the ground truth, especially the error in 3D space. ### Experimental Results - **Datasets**: Experiments were carried out on two datasets, PanoContext and Matterport3D. - **Performance Comparison**: PanoViT outperforms existing state - of - the - art methods in multiple metrics, especially in the estimation of complex room layouts. For example, on the Matterport3D dataset, PanoViT reaches 82.04% and 84.25% in 3D IoU and 2D IoU respectively, significantly outperforming other methods. ### Conclusion By combining the Vision Transformer and multiple innovative techniques, PanoViT effectively solves the problem of estimating complex room layouts from a single panoramic image and demonstrates superior performance in this task.