Abstract:Based on the Manhattan World assumption, most existing indoor layout estimation schemes focus on recovering layouts from vertically compressed 1D sequences. However, the compression procedure confuses the semantics of different planes, yielding inferior performance with ambiguous interpretability. To address this issue, we propose to disentangle this 1D representation by pre-segmenting orthogonal (vertical and horizontal) planes from a complex scene, explicitly capturing the geometric cues for indoor layout estimation. Considering the symmetry between the floor boundary and ceiling boundary, we also design a soft-flipping fusion strategy to assist the pre-segmentation. Besides, we present a feature assembling mechanism to effectively integrate shallow and deep features with distortion distribution awareness. To compensate for the potential errors in pre-segmentation, we further leverage triple attention to reconstruct the disentangled sequences for better performance. Experiments on four popular benchmarks demonstrate our superiority over existing SoTA solutions, especially on the 3DIoU metric. The code is available at \url{<a class="link-external link-https" href="https://github.com/zhijieshen-bjtu/DOPNet" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in indoor panoramic room layout estimation: 1. **Severe distortion of panoramic images**: 360 - degree panoramas introduce severe distortion in the latitude direction due to their wide field of view (FoV), which makes it difficult to infer 3D information from 2D images. 2. **Semantic confusion caused by compressed representation**: Most existing indoor layout estimation methods rely on compressing the extracted 2D feature maps in the height dimension to obtain 1D sequences. This compression method confuses the semantic information between different planes (such as vertical and horizontal planes), resulting in performance degradation and ambiguous interpretability. To address these problems, the author proposes a new framework, which specifically includes the following aspects: - **Decoupling orthogonal planes**: Explicitly capture geometric cues by pre - segmenting vertical and horizontal planes, thereby avoiding semantic confusion. - **Soft - flip fusion strategy**: Utilize the symmetry of floor and ceiling boundaries to design a soft - flip fusion strategy to assist the pre - segmentation process. - **Cross - scale distortion - aware feature assembly mechanism**: Propose a feature assembly mechanism that can effectively integrate shallow geometric structures and deep semantic features and handle distortion distribution. - **Triple - attention mechanism**: To solve potential errors in pre - segmentation, use a triple - attention mechanism to reconstruct the decoupled 1D sequence to make it more discriminative and informative. Finally, this method was experimentally verified on four popular datasets, and the results show that it outperforms the existing state - of - the - art methods in metrics such as 3DIoU. ### Formula summary - **Feature aggregation formula**: \[ f_{df}=\sum_{q = 1}^{H\times W}\sum_{k = 1}^{9}\text{Sample}(f,p_{q,k}+\Delta p_{q,k}) \] where \(p_{q,k}\) is the sampling coordinate and \(\Delta p_{q,k}\) is the learnable offset. - **Multi - scale feature fusion formula**: \[ f'_{m\tilde{s}}=\text{CSDA}(f_{m\tilde{s}})=\sum_{l = 1}^{L}A_l\cdot\text{reshape}(f_{m\tilde{s}}) \] where \(A_l\) is the self - attention weight matrix. - **Channel discriminative generation mechanism**: \[ f'=L_fW=(I - A)fW \] where \(L_f\) and \(A\) are the symmetrically normalized Laplacian matrix and the normalized adjacency matrix, respectively. - **Self - attention formula**: \[ \text{Attention}(Q',K',V')=\text{softmax}\left(\frac{Q'(K')^T}{\sqrt{d_k}}V'\right) \] - **Cross - attention formula**: \[ \text{Attention}(Q''_h,K''_v,V''_v)=\text{softmax}\left(\frac{Q''_h(K''_v)^T}{\sqrt{d_k}}V''_v\right) \] The application of these formulas enables the model to more accurately capture and utilize geometric cues while dealing with distortion, thereby improving the performance of indoor layout estimation.

Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion Awareness

360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception

Indoor Video Layout Estimation Based on Plane Features and Motion Information

Self-supervised 360$^{\circ}$ Room Layout Estimation

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization

Layouts from Panoramic Images with Geometry and Deep Learning

Automatic 3D Indoor Scene Modeling from Single Panorama

Indoor Panorama Planar 3D Reconstruction via Divide and Conquer

From Semi-supervised to Omni-supervised Room Layout Estimation Using Point Clouds.

The Polygonal 3D Layout Reconstruction of an Indoor Environment via Voxel-Based Room Segmentation and Space Partition

Transferable End-to-end Room Layout Estimation via Implicit Encoding

Indoor Scene Understanding Based on Manhattan and Non-Manhattan Projection of Spatial Right-Angles.

Manhattan Room Layout Reconstruction from a Single $360^{\circ }$ Image: A Comparative Study of State-of-the-Art Methods.

Room Layout Estimation by Learning Depth Maps of Planes from 2D Layout Labels

Efficient 3D Room Shape Recovery from a Single Panorama.

Scaled 360 layouts: Revisiting non-central panoramas

Atlanta Scaled layouts from non-central panoramas

Planar Reconstruction of Indoor Scenes from Sparse Views and Relative Camera Poses

Real-time indoor scene reconstruction with Manhattan assumption

Estimating Spatial Layout of Cluttered Rooms by Using Object Prior and Spatial Constraints