Abstract:Instance segmentation is a challenging task in computer vision, as it requires distinguishing objects and predicting dense areas. Currently, segmentation models based on complex designs and large parameters have achieved remarkable accuracy. However, from a practical standpoint, achieving a balance between accuracy and speed is even more desirable. To address this need, this paper presents ESAMask, a real-time segmentation model fused with efficient sparse attention, which adheres to the principles of lightweight design and efficiency. In this work, we propose several key contributions. Firstly, we introduce a dynamic and sparse Related Semantic Perceived Attention mechanism (RSPA) for adaptive perception of different semantic information of various targets during feature extraction. RSPA uses the adjacency matrix to search for regions with high semantic correlation of the same target, which reduces computational cost. Additionally, we design the GSInvSAM structure to reduce redundant calculations of spliced features while enhancing interaction between channels when merging feature layers of different scales. Lastly, we introduce the Mixed Receptive Field Context Perception Module (MRFCPM) in the prototype branch to enable targets of different scales to capture the feature representation of the corresponding area during mask generation. MRFCPM fuses information from three branches of global content awareness, large kernel region awareness, and convolutional channel attention to explicitly model features at different scales. Through extensive experimental evaluation, ESAMask achieves a mask AP of 45.4 at a frame rate of 45.2 FPS on the COCO dataset, surpassing current instance segmentation methods in terms of the accuracy–speed trade-off, as demonstrated by our comprehensive experimental results. In addition, the high-quality segmentation results of our proposed method for objects of various classes and scales can be intuitively observed from the visualized segmentation outputs.

SipMaskv2: Enhanced Fast Image and Video Instance Segmentation

SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

Delving Deeper into Mask Utilization in Video Object Segmentation

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

MSN: Efficient Online Mask Selection Network for Video Instance Segmentation

Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

Mask Propagation for Efficient Video Semantic Segmentation

ESAMask: Real-Time Instance Segmentation Fused with Efficient Sparse Attention

RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features

SiamMask: A Framework for Fast Online Object Tracking and Segmentation

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

FocalClick: Towards Practical Interactive Image Segmentation.

PM-VIS: High-Performance Box-Supervised Video Instance Segmentation

Fast Online Object Tracking and Segmentation: A Unifying Approach

SOLOv2: Dynamic and Fast Instance Segmentation

EmbedMask: Embedding Coupling for Instance Segmentation

InstMove: Instance Motion for Object-centric Video Segmentation

CenterMask: Real-Time Anchor-Free Instance Segmentation

Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation

SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation.

Occluded Video Instance Segmentation: A Benchmark