Abstract:The goal of multi-object tracking (MOT) is to detect and track all objects in a scene across frames, while maintaining a unique identity for each object. Most existing methods rely on the spatial-temporal motion features and appearance embedding features of the detected objects in consecutive frames. Effectively and robustly representing the spatial and appearance features of long trajectories has become a critical factor affecting the performance of MOT. We propose a novel approach for appearance and spatial-temporal motion feature representation, improving upon the hierarchical clustering association method MOT FCG. For spatialtemporal motion features, we first propose Diagonal Modulated GIoU, which more accurately represents the relationship between the position and shape of the objects. Second, Mean Constant Velocity Modeling is proposed to reduce the effect of observation noise on target motion state estimation. For appearance features, we utilize a dynamic appearance representation that incorporates confidence information, enabling the trajectory appearance features to be more robust and global. Based on the baseline model MOT FCG, we have realized further improvements in the performance of all. we achieved 63.1 HOTA, 76.9 MOTA and 78.2 IDF1 on the MOT17 test set, and also achieved competitive performance on the MOT20 and DanceTrack sets.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multi - object tracking (MOT), specifically including: 1. **Representation of spatio - temporal motion features and appearance features**: - Existing methods have limitations in representing spatio - temporal motion features and appearance features in long - term trajectories. For example, using IoU (Intersection over Union) to represent the spatial position relationship is not accurate enough and cannot well reflect the object shape information. - The method of using the median element as the appearance feature representation is prone to introduce low - quality appearance features in long - term trajectories, resulting in deviations in trajectory association. 2. **The influence of observation noise on the estimation of the target motion state**: - Existing methods are insufficient in dealing with the influence of observation noise on the estimation of the target motion state, especially in long - time series, where the noise will accumulate and affect the tracking accuracy. 3. **Improving multi - object tracking performance**: - The paper aims to improve the overall performance of multi - object tracking, especially the robustness and accuracy in complex scenarios, by improving the representation methods of spatio - temporal motion features and appearance features. To this end, the author proposes the following improvement measures: 1. **Diagonal Modulated GIoU (Diagonal Modulated GIoU)**: \[ dDGIoU = 1-\frac{L_2}{L_1}\cdot GIoU, \] \[ \lambda_C=\min\left(1,\frac{dDGIoU}{2}+off\right), \] where \(A\) and \(B\) represent the areas of two objects, and \(C\) represents the area of the smallest rectangle containing \(A\) and \(B\). This method can represent the positional relationship between objects more precisely and partially reflect the shape information of the object bounding boxes. 2. **Dynamic Appearance (Dynamic Appearance)**: \[ e_t=\beta_t e_{t - 1}+(1-\beta_t)e_{new}, \] \[ \beta_t=\beta_f+(1-\beta_f)\left(1-\frac{s_{det}-\sigma}{1-\sigma}\right), \] where \(s_{det}\) is the detection confidence and \(\sigma\) is the confidence threshold. This method can adaptively adjust the weights according to the detection quality, thereby better representing the global appearance features and avoiding low - quality features introduced by occlusion or background interference. 3. **Average Constant Velocity Modeling (Average Constant Velocity Modeling)**: \[ v=\frac{x_t - x_{t - N}}{N}, \] \[ x_{t + p}=x_t+v\times p, \] By averaging the velocities of the past \(N\) frames to estimate the future position, the influence of observation noise on the velocity calculation is reduced, and the stability of the prediction is improved. These improvements enable the MOT FCG++ method proposed in the paper to achieve significant performance improvements on multiple datasets, especially on the MOT17 and MOT20 datasets.

MOT FCG++: Enhanced Representation of Spatio-temporal Motion and Appearance Features

FlowMOT: 3D Multi-Object Tracking by Scene Flow Association

Exploit the Connectivity

Exploit the Connectivity: Multi-Object Tracking with TrackletNet

MAT: Motion-Aware Multi-Object Tracking

Dense Scene Multiple Object Tracking with Box-Plane Matching

Multi-object tracking with adaptive measurement noise and information fusion

Multimodal Multiobject Tracking by Fusing Deep Appearance Features and Motion Information

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Refinements in Motion and Appearance for Online Multi-Object Tracking

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

Frame-wise Motion and Appearance for Real-time Multiple Object Tracking.

Focus On Details: Online Multi-Object Tracking with Diverse Fine-Grained Representation

IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency

CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking With Camera-LiDAR Fusion

MFACNet: A Multi-Frame Feature Aggregating and Inter-Feature Correlation Framework for Multi-Object Tracking in Satellite Videos

STMMOT: Advancing multi-object tracking through spatiotemporal memory networks and multi-scale attention pyramids

Rt-Track: Robust Tricks for Multi-Pedestrian Tracking

PF-MOT: Probability Fusion Based 3D Multi-Object Tracking for Autonomous Vehicles

Multi-Granularity Language-Guided Multi-Object Tracking