Abstract:Training data is a critical requirement for machine learning tasks, and labeled training data can be expensive to acquire, often requiring manual or semi-automated data collection pipelines. For tracking applications, the data collection involves drawing bounding boxes around the classes of interest on each frame, and associate detections of the same "instance" over frames. In a semi-automated data collection pipeline, this can be achieved by running a baseline detection and tracking algorithm, and relying on manual correction to add/remove/change bounding boxes on each frame, as well as resolving errors in the associations over frames (track switches). In this paper, we propose a data correction pipeline to generate ground-truth data more efficiently in this semi-automated scenario. Our method simplifies the trajectories from the tracking systems and let the annotator verify and correct the objects in the sampled keyframes. Once the objects in the keyframes are corrected, the bounding boxes in the other frames are obtained by interpolation. Our method achieves substantial reduction in the number of frames requiring manual correction. In the MOT dataset, it reduces the number of frames by 30x while maintaining a HOTA score of 89.61% . Moreover, it reduces the number of frames by a factor of 10x while achieving a HOTA score of 79.24% in the SoccerNet dataset, and 85.79% in the DanceTrack dataset. The project code and data are publicly released at <a class="link-external link-https" href="https://github.com/foreverYoungGitHub/trajectory-simplify-benchmark" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently generate high - quality labeled data in video object tracking. Specifically, the paper focuses on how to reduce the number of frames requiring manual correction by simplifying trajectories during the semi - automatic data collection process, while maintaining the tracking quality close to that of fully manual correction. The key to this problem lies in how to select key frames for manual correction so that after correction on these key frames, the exact trajectory of the entire sequence can be restored by interpolation, thereby significantly reducing the annotation workload and cost.
### Background and Problem Description of the Paper
In machine - learning tasks, training data is very crucial, and labeling training data is often very expensive, usually requiring manual or semi - automatic data collection pipelines. For tracking applications, data collection involves drawing bounding boxes of the interested category in each frame and correlating the detection results of the same "instance". In a semi - automatic data collection pipeline, this can be achieved by running baseline detection and tracking algorithms, and then relying on manual correction to add, delete or change the bounding boxes in each frame, as well as to solve inter - frame association errors (such as trajectory switching). The paper proposes a data correction pipeline, aiming to generate ground - truth data more efficiently in this semi - automatic scenario.
### Solution
The method proposed in the paper simplifies the trajectories generated by the tracking system, allowing annotators to only verify and correct objects on the sampled key frames. Once the objects on the key frames are corrected, the bounding boxes of other frames are obtained by interpolation. This method significantly reduces the number of frames requiring manual correction. For example, in the MOT dataset, this method can reduce the number of frames requiring manual correction by 30 times while maintaining a HOTA score of 89.61%; in the SoccerNet dataset, it can be reduced by 10 times with a HOTA score of 79.24%; in the DanceTrack dataset, it can be reduced by 10 times with a HOTA score of 85.79%.
### Method Details
1. **Initializing the Search Space**: First, high - quality bounding boxes and low - quality outliers are screened out according to the confidence scores of the predicted bounding boxes. This step uses a method similar to the Douglas - Peucker algorithm to select key frames by calculating the maximum error of each anchor segment.
2. **Minimizing the Integral Error**: To further optimize trajectory simplification, the paper constructs a directed acyclic graph (DAG) and selects the optimal simplified trajectory by minimizing the global integral error. Specifically, each node in the DAG stores the integral error from the root node to the current node, and the best parent node is selected by dynamic programming to minimize the integral error.
3. **Scale - Invariant Error Metric**: The paper proposes an error metric based on the synchronous IoU distance, which takes into account the scale changes of bounding boxes and combines confidence scores in a weighted manner, making the simplified trajectory more robust and reducing the need for manual correction.
### Experimental Results
The paper conducted experiments on three datasets, MOT20, SoccerNet and DanceTrack. The results show that this method can still maintain a relatively high tracking quality under a high compression rate. Especially on the DanceTrack dataset, all trajectory simplification methods are superior to uniform sampling at all compression rates, because the DanceTrack dataset contains a large number of non - linear motions, and the key change points of these motions are more easily captured by trajectory simplification methods.
### Conclusion
The paper proposes a scale - invariant trajectory simplification method, which can significantly reduce the workload of manual correction in video object tracking while maintaining a relatively high tracking quality. The performance of this method on multiple datasets is better than that of existing trajectory simplification methods and has broad application prospects.