Abstract:360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360 image's ultra-wide FoV; 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost, a Bi-projection Bi-attention Fusion module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M.

Bi-projection for 360°image Object Detection Bridged by RoI Searcher

Reprojection R-CNN: A Fast and Accurate Object Detector for 360° Images

Multi-Projection Fusion and Refinement Network for Salient Object Detection in 360° Omnidirectional Image

Spherical Criteria for Fast and Accurate 360° Object Detection

Omnidirectional Image Super-resolution Via Bi-projection Fusion

Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion

Real-Time Object Detection for 360-Degree Panoramic Image Using CNN

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

Complementary Bi-directional Feature Compression for Indoor 360{\deg} Semantic Segmentation with Self-distillation

Bidirectional Projection Network for Cross Dimension Scene Understanding

Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration

CRF360D: Monocular 360 Depth Estimation via Spherical Fully-Connected CRFs

Complementary Bi-directional Feature Compression for Indoor 360° Semantic Segmentation with Self-distillation

A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation

Panoramic visual system for spherical mobile robots

2D-to-3D Projection for Monocular and Multi-View 3D Multi-class Object Detection in Indoor Scenes

3D Object Detector: A Multiscale Region Proposal Network Based on Autonomous Driving.

BiFuse++: Self-Supervised and Efficient Bi-Projection Fusion for 360° Depth Estimation

Afnet: Asymmetric Fusion Network for Monocular Panorama Depth Estimation

360Recon: An Accurate Reconstruction Method Based on Depth Fusion from 360 Images

From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection