Abstract:In 3D point cloud-based object detection, attention mechanism in Group-Free [1] learns direct relationships between proposals and all seed points, providing each proposal with a global context in the form of a cross-attention map. However, our analysis and experimental comparison show that the attention mechanism assigns inappropriately large attention weights to certain seed points far from a proposal, which is not conducive to detecting objects correctly. In this work, we alleviate the above problem by proposing a mask method. For an initial proposal, our method first calculates a spatial distance-based mask, which measures the spatial relationship between all seed points and the proposal. Then, we fuse the mask into cross-attention layers in stacked attention modules and get a refined cross-attention map. In essence, our mask gives each proposal a local context; after it is fused with the global context given by the attention mechanism, the refined cross-attention map could suppress the negative impact of some distant seed points on a proposal. We present two alternative strategies to compute the mask, a hard mask, and a soft mask. Experimental results demonstrate that the soft mask brings better performance. In the soft mask, for each initial proposal's 3D-box shape, we use a parametric approximate ellipsoid as the basis of the mask's calculation, which has only two learnable parameters. Experimental results show our work could outperform Group-Free 0.7 mAP@0.25 at the cost of increasing inference time by less than 1%. The performance of our algorithm on the public dataset SUN RGB-D is 63.7 mAP@0.25 and 45.5 mAP@0.5, which is the best performance among algorithms that preserve the irregular of seed points.

Efficiently Detecting Plausible Locations for Object Placement Using Masked Convolutions

TopNet: Transformer-based Object Placement Network for Image Compositing

Image Synthesis from Layout with Locality-Aware Mask Adaption

Constrained Online Cut-Paste for Object Detection

An Efficient Ungrouped Mask Method with Two Learnable Parameters for 3D Object Detection

Exploiting Weak Mask Representation with Convolutional Neural Networks for Accurate Object Tracking.

Convolutional Feature Masking for Joint Object and Stuff Segmentation

Mask Guided Gated Convolution for Amodal Content Completion

Mask6D: Masked Pose Priors for 6D Object Pose Estimation.

BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth

MaskVD: Region Masking for Efficient Video Object Detection

Occlusion-Aware Object Localization, Segmentation and Pose Estimation

BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation

Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features

Boosting Convolutional Features for Robust Object Proposals

Stochastic positional embeddings improve masked image modeling

MaskInversion: Localized Embeddings via Optimization of Explainability Maps

CenterMask: Real-Time Anchor-Free Instance Segmentation

Masked Multi-Query Slot Attention for Unsupervised Object Discovery

Efficient Center Voting for Object Detection and 6D Pose Estimation in 3D Point Cloud