MOT FCG++: Enhanced Representation of Spatio-temporal Motion and Appearance Features

Yanzhao Fang
2024-11-21
Abstract:The goal of multi-object tracking (MOT) is to detect and track all objects in a scene across frames, while maintaining a unique identity for each object. Most existing methods rely on the spatial-temporal motion features and appearance embedding features of the detected objects in consecutive frames. Effectively and robustly representing the spatial and appearance features of long trajectories has become a critical factor affecting the performance of MOT. We propose a novel approach for appearance and spatial-temporal motion feature representation, improving upon the hierarchical clustering association method MOT FCG. For spatialtemporal motion features, we first propose Diagonal Modulated GIoU, which more accurately represents the relationship between the position and shape of the objects. Second, Mean Constant Velocity Modeling is proposed to reduce the effect of observation noise on target motion state estimation. For appearance features, we utilize a dynamic appearance representation that incorporates confidence information, enabling the trajectory appearance features to be more robust and global. Based on the baseline model MOT FCG, we have realized further improvements in the performance of all. we achieved 63.1 HOTA, 76.9 MOTA and 78.2 IDF1 on the MOT17 test set, and also achieved competitive performance on the MOT20 and DanceTrack sets.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve several key problems in multi - object tracking (MOT), specifically including: 1. **Representation of spatio - temporal motion features and appearance features**: - Existing methods have limitations in representing spatio - temporal motion features and appearance features in long - term trajectories. For example, using IoU (Intersection over Union) to represent the spatial position relationship is not accurate enough and cannot well reflect the object shape information. - The method of using the median element as the appearance feature representation is prone to introduce low - quality appearance features in long - term trajectories, resulting in deviations in trajectory association. 2. **The influence of observation noise on the estimation of the target motion state**: - Existing methods are insufficient in dealing with the influence of observation noise on the estimation of the target motion state, especially in long - time series, where the noise will accumulate and affect the tracking accuracy. 3. **Improving multi - object tracking performance**: - The paper aims to improve the overall performance of multi - object tracking, especially the robustness and accuracy in complex scenarios, by improving the representation methods of spatio - temporal motion features and appearance features. To this end, the author proposes the following improvement measures: 1. **Diagonal Modulated GIoU (Diagonal Modulated GIoU)**: \[ dDGIoU = 1-\frac{L_2}{L_1}\cdot GIoU, \] \[ \lambda_C=\min\left(1,\frac{dDGIoU}{2}+off\right), \] where \(A\) and \(B\) represent the areas of two objects, and \(C\) represents the area of the smallest rectangle containing \(A\) and \(B\). This method can represent the positional relationship between objects more precisely and partially reflect the shape information of the object bounding boxes. 2. **Dynamic Appearance (Dynamic Appearance)**: \[ e_t=\beta_t e_{t - 1}+(1-\beta_t)e_{new}, \] \[ \beta_t=\beta_f+(1-\beta_f)\left(1-\frac{s_{det}-\sigma}{1-\sigma}\right), \] where \(s_{det}\) is the detection confidence and \(\sigma\) is the confidence threshold. This method can adaptively adjust the weights according to the detection quality, thereby better representing the global appearance features and avoiding low - quality features introduced by occlusion or background interference. 3. **Average Constant Velocity Modeling (Average Constant Velocity Modeling)**: \[ v=\frac{x_t - x_{t - N}}{N}, \] \[ x_{t + p}=x_t+v\times p, \] By averaging the velocities of the past \(N\) frames to estimate the future position, the influence of observation noise on the velocity calculation is reduced, and the stability of the prediction is improved. These improvements enable the MOT FCG++ method proposed in the paper to achieve significant performance improvements on multiple datasets, especially on the MOT17 and MOT20 datasets.