Temporally Propagated Masks and Bounding Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

Tomasz Stanczyk,Francois Bremond
2024-11-23
Abstract:Multi-object tracking (MOT) involves identifying and consistently tracking objects across video sequences. Traditional tracking-by-detection methods, while effective, often require extensive tuning and lack generalizability. On the other hand, segmentation mask-based methods are more generic but struggle with tracking management, making them unsuitable for MOT. We propose a novel approach, McByte, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework. By combining bounding box and propagated mask information, McByte enhances robustness and generalizability without per-sequence tuning. Evaluated on four benchmark datasets - DanceTrack, MOT17, SoccerNet-tracking 2022, and KITTI-tracking - McByte demonstrates performance gain in all cases examined. At the same time, it outperforms existing mask-based methods. Implementation code will be provided upon acceptance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in multi - object tracking (MOT): 1. **Limitations of traditional tracking methods**: - **Detection - based tracking methods**: Although these methods are effective, they usually require a large amount of hyper - parameter tuning for each data set or sequence, which limits their generalization ability. - **Segmentation - mask - based methods**: These methods are more general, but lack robustness when dealing with multiple objects and it is difficult to detect objects newly entering the scene. In addition, they rely entirely on mask prediction to determine the object position, and when the prediction is inaccurate, the performance will be affected. 2. **Combining the advantages of the two methods**: - The paper proposes a new method - McByte, which incorporates temporally - propagated segmentation masks as strong association cues into the detection - based tracking framework. By combining bounding box and propagated mask information, McByte improves the robustness and generalization ability of tracking without the need for tuning for each sequence. 3. **Specific problems**: - **Long - time occlusion**: In crowded scenes, objects may be partially occluded, causing traditional detection - based tracking methods to have difficulty maintaining consistent tracking. - **Generalization ability**: Existing methods perform unstably on different data sets and require specific adjustments for each data set. ### Main contributions of McByte 1. **Evaluating existing mask - based methods**: Demonstrates the shortcomings of these methods in MOT tasks (see Section 4.5). 2. **Proposing a novel temporally - propagated mask method**: For the first time, applies temporally - propagated segmentation masks as association cues in MOT tasks (see Section 3.3). 3. **Designing an improved MOT algorithm**: Solves the limitations of existing methods by combining temporally - propagated masks and bounding box information (see Sections 3.1 and 3.2). 4. **Experimental verification**: Evaluated on four different MOT data sets, proving that McByte outperforms existing detection - based tracking methods without per - sequence tuning (see Section 4.4). ### Formula summary The formulas involved in the paper are mainly used to describe the calculation of the matching degree between masks and bounding boxes: - **Mask coverage ratio (Mask Match No. 1, \( \text{mm}_1 \))**: \[ \text{mm}_{i,j}^{(1)}=\frac{\vert \text{pix}(\text{mask}(i))\cap \text{pix}(\text{bbox}_j)\vert}{\vert \text{pix}(\text{mask}(i))\vert} \] - **Bounding box filling ratio (Mask Match No. 2, \( \text{mm}_2 \))**: \[ \text{mm}_{i,j}^{(2)}=\frac{\vert \text{pix}(\text{mask}(i))\cap \text{pix}(\text{bbox}_j)\vert}{\vert \text{pix}(\text{bbox}_j)\vert} \] - **Updating the cost matrix**: \[ \text{cost}_{i,j}=\text{cost}_{i,j}-\text{mm}_{i,j}^{(2)} \] These formulas help quantify the matching degree between masks and bounding boxes, thereby improving the accuracy of tracking.