Abstract:Segmentation-based tracking paradigm has been successfully applied to tracking field and significantly improves the tracking performance. Although segmentation-based trackers are effective for target scale estimation, it makes the trackers have high requirements for the extracted target features due to the need for pixel-level segmentation. Therefore, in this article, we propose a novel multi feature representation and aggregation network and introduce it into tracking-by-segmentation framework to extract and integrate rich features for segmentation-based tracking. To be specific, the proposed approach firstly models three complementary feature representations through cross-attention, cross-correlation and dilated involution mechanisms respectively and employ a simple feature aggregation network to fuse these features. And then feeding those fusion features into a segmentation network obtains the accurate target state estimation. In addition, we introduce a bounding box refinement module to further refine the target box to alleviate the issues of partial occlusion and surrounding distractors. The extensive experimental results show that the proposed tracker achieves very promising tracking performance on seven challenging visual tracking benchmarks. Code and models are available at https://github.com/Yang428/FEAST.

Multi Feature Representation and Aggregation Network for Accurate and Robust Visual Tracking.