Video Text Tracking with Transformer-Based Local Search

Xingsheng Zhou,Cheng Wang,Xinggang Wang,Wenyu Liu
DOI: https://doi.org/10.1016/j.neucom.2024.128420
IF: 6
2024-01-01
Neurocomputing
Abstract:Video text tracking is a highly significant branch in the field of multi-object tracking (MOT), aiming to detect all text instances in video frames and construct trajectories for each text. Conventional text tracking methods typically follow the Tracking-By-Detection (TbD) paradigm which involves two separate steps of detection and association. In complex scenarios, abundant detection omission of text instances is caused by exposure, occlusion, motion blur, and the like, leading to text trajectory breakpoints in the TbD methods. To this end, video text tracking with transformer-based local search (LSTrack) is proposed in this paper. We utilize historical trajectory information to estimate the approximate search areas where omitted texts are located. In this case, our local search tracker leverages the text images in historical trajectories as references to directly recall text instances in the search areas. In addition, our tracking framework can combine the explicit semantic information output from an OCR model to obtain the semantic version (SV) of LSTrack, which for the first time uses the text edit distance as the distance measurement in matching stages to achieve better text tracking results. Ultimately, our method achieves an advanced performance on several public benchmarks. In particular, LSTrack (SV) achieves state-of-the-art performance on the Minetto benchmark.
What problem does this paper attempt to address?