Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion Awareness

Zhijie Shen,Zishuo Zheng,Chunyu Lin,Lang Nie,Kang Liao,Shuai Zheng,Yao Zhao
2023-03-04
Abstract:Based on the Manhattan World assumption, most existing indoor layout estimation schemes focus on recovering layouts from vertically compressed 1D sequences. However, the compression procedure confuses the semantics of different planes, yielding inferior performance with ambiguous interpretability. To address this issue, we propose to disentangle this 1D representation by pre-segmenting orthogonal (vertical and horizontal) planes from a complex scene, explicitly capturing the geometric cues for indoor layout estimation. Considering the symmetry between the floor boundary and ceiling boundary, we also design a soft-flipping fusion strategy to assist the pre-segmentation. Besides, we present a feature assembling mechanism to effectively integrate shallow and deep features with distortion distribution awareness. To compensate for the potential errors in pre-segmentation, we further leverage triple attention to reconstruct the disentangled sequences for better performance. Experiments on four popular benchmarks demonstrate our superiority over existing SoTA solutions, especially on the 3DIoU metric. The code is available at \url{<a class="link-external link-https" href="https://github.com/zhijieshen-bjtu/DOPNet" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main problems in indoor panoramic room layout estimation: 1. **Severe distortion of panoramic images**: 360 - degree panoramas introduce severe distortion in the latitude direction due to their wide field of view (FoV), which makes it difficult to infer 3D information from 2D images. 2. **Semantic confusion caused by compressed representation**: Most existing indoor layout estimation methods rely on compressing the extracted 2D feature maps in the height dimension to obtain 1D sequences. This compression method confuses the semantic information between different planes (such as vertical and horizontal planes), resulting in performance degradation and ambiguous interpretability. To address these problems, the author proposes a new framework, which specifically includes the following aspects: - **Decoupling orthogonal planes**: Explicitly capture geometric cues by pre - segmenting vertical and horizontal planes, thereby avoiding semantic confusion. - **Soft - flip fusion strategy**: Utilize the symmetry of floor and ceiling boundaries to design a soft - flip fusion strategy to assist the pre - segmentation process. - **Cross - scale distortion - aware feature assembly mechanism**: Propose a feature assembly mechanism that can effectively integrate shallow geometric structures and deep semantic features and handle distortion distribution. - **Triple - attention mechanism**: To solve potential errors in pre - segmentation, use a triple - attention mechanism to reconstruct the decoupled 1D sequence to make it more discriminative and informative. Finally, this method was experimentally verified on four popular datasets, and the results show that it outperforms the existing state - of - the - art methods in metrics such as 3DIoU. ### Formula summary - **Feature aggregation formula**: \[ f_{df}=\sum_{q = 1}^{H\times W}\sum_{k = 1}^{9}\text{Sample}(f,p_{q,k}+\Delta p_{q,k}) \] where \(p_{q,k}\) is the sampling coordinate and \(\Delta p_{q,k}\) is the learnable offset. - **Multi - scale feature fusion formula**: \[ f'_{m\tilde{s}}=\text{CSDA}(f_{m\tilde{s}})=\sum_{l = 1}^{L}A_l\cdot\text{reshape}(f_{m\tilde{s}}) \] where \(A_l\) is the self - attention weight matrix. - **Channel discriminative generation mechanism**: \[ f'=L_fW=(I - A)fW \] where \(L_f\) and \(A\) are the symmetrically normalized Laplacian matrix and the normalized adjacency matrix, respectively. - **Self - attention formula**: \[ \text{Attention}(Q',K',V')=\text{softmax}\left(\frac{Q'(K')^T}{\sqrt{d_k}}V'\right) \] - **Cross - attention formula**: \[ \text{Attention}(Q''_h,K''_v,V''_v)=\text{softmax}\left(\frac{Q''_h(K''_v)^T}{\sqrt{d_k}}V''_v\right) \] The application of these formulas enables the model to more accurately capture and utilize geometric cues while dealing with distortion, thereby improving the performance of indoor layout estimation.