Abstract:In this paper, we propose PanoViT, a panorama vision transformer to estimate the room layout from a single panoramic image. Compared to CNN models, our PanoViT is more proficient in learning global information from the panoramic image for the estimation of complex room layouts. Considering the difference between a perspective image and an equirectangular image, we design a novel recurrent position embedding and a patch sampling method for the processing of panoramic images. In addition to extracting global information, PanoViT also includes a frequency-domain edge enhancement module and a 3D loss to extract local geometric features in a panoramic image. Experimental results on several datasets demonstrate that our method outperforms state-of-the-art solutions in room layout prediction accuracy.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is room layout estimation from a single panoramic image. Specifically, the authors propose a new model - PanoViT, which is a Vision Transformer - based method for estimating room layouts from 360 - degree panoramic images. Compared with traditional Convolutional Neural Network (CNN) models, PanoViT is better at learning global information in panoramic images, which is especially important for estimating complex room layouts. ### Main Problems and Solutions 1. **Limitations of Traditional Methods**: - Convolutional Neural Networks (CNN) have limitations when dealing with complex room layouts because convolutional filters are more suitable for extracting local features, while estimating complex room layouts requires obtaining global information from panoramic images. - Panoramic images have significant differences in geometric properties from perspective images, which makes directly using models pre - trained on perspective image datasets unable to obtain satisfactory room layout estimation results. 2. **Innovations of PanoViT**: - **Multi - scale Feature Extraction**: PanoViT combines multi - scale features extracted by the CNN backbone network. These features are input into the Vision Transformer together with the original panoramic image, enhancing the ability to learn global information. - **New Position Embedding**: A Recurrent Position Embedding is designed to adapt to the special properties of panoramic images, especially the translational invariance in the horizontal direction. - **Frequency - domain Edge Enhancement**: A frequency - domain edge enhancement module is introduced. Edge information in the panoramic image is extracted through Fourier transform and inverse transform, further improving the performance of the model. - **3D Loss Function**: A 3D loss function based on the geometric information of panoramic images is designed to more accurately measure the error between the prediction result and the ground truth, especially the error in 3D space. ### Experimental Results - **Datasets**: Experiments were carried out on two datasets, PanoContext and Matterport3D. - **Performance Comparison**: PanoViT outperforms existing state - of - the - art methods in multiple metrics, especially in the estimation of complex room layouts. For example, on the Matterport3D dataset, PanoViT reaches 82.04% and 84.25% in 3D IoU and 2D IoU respectively, significantly outperforming other methods. ### Conclusion By combining the Vision Transformer and multiple innovative techniques, PanoViT effectively solves the problem of estimating complex room layouts from a single panoramic image and demonstrates superior performance in this task.

PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization

PanoContext-Former: Panoramic Total Scene Understanding with a Transformer

LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image

GLPanoDepth: Global-to-Local Panoramic Depth Estimation

PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation

PanoSwin: a Pano-style Swin Transformer for Panorama Understanding

GPR-Net: Multi-view Layout Estimation via a Geometry-aware Panorama Registration Network

3D Room Layout Estimation from a Cubemap of Panorama Image Via Deep Manhattan Hough Transform

3D Orientation Estimation and Vanishing Point Extraction from Single Panoramas Using Convolutional Neural Network

Indoor Panorama Planar 3D Reconstruction via Divide and Conquer

Multi-Viewpoint Panorama Construction with Wide-Baseline Images

Local-to-Global Panorama Inpainting for Locale-Aware Indoor Lighting Prediction

Transferable End-to-end Room Layout Estimation via Implicit Encoding

DuLa-Net: A Dual-Projection Network for Estimating Room Layouts from a Single RGB Panorama

Panoramic Vision Transformer for Saliency Detection in 360° Videos

Layouts from Panoramic Images with Geometry and Deep Learning

Pano2Room: Novel View Synthesis from a Single Indoor Panorama

Manhattan Room Layout Reconstruction from a Single $360^{\circ }$ Image: A Comparative Study of State-of-the-Art Methods.

360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception