Coordinate-Aware Thermal Infrared Tracking Via Natural Language Modeling

Miao Yan,Ping Zhang,Haofei Zhang,Ruqian Hao,Juanxiu Liu,Xiaoyang Wang,Lin Liu
2024-07-26
Abstract:Thermal infrared (TIR) tracking is pivotal in computer vision tasks due to its all-weather imaging capability. Traditional tracking methods predominantly rely on hand-crafted features, and while deep learning has introduced correlation filtering techniques, these are often constrained by rudimentary correlation operations. Furthermore, transformer-based approaches tend to overlook temporal and coordinate information, which is critical for TIR tracking that lacks texture and color information. In this paper, to address these issues, we apply natural language modeling to TIR tracking and propose a coordinate-aware thermal infrared tracking model called NLMTrack, which enhances the utilization of coordinate and temporal information. NLMTrack applies an encoder that unifies feature extraction and feature fusion, which simplifies the TIR tracking pipeline. To address the challenge of low detail and low contrast in TIR images, on the one hand, we design a multi-level progressive fusion module that enhances the semantic representation and incorporates multi-scale features. On the other hand, the decoder combines the TIR features and the coordinate sequence features using a causal transformer to generate the target sequence step by step. Moreover, we explore an adaptive loss aimed at elevating tracking accuracy and a simple template update strategy to accommodate the target's appearance variations. Experiments show that NLMTrack achieves state-of-the-art performance on multiple benchmarks. The Code is publicly available at \url{<a class="link-external link-https" href="https://github.com/ELOESZHANG/NLMTrack" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in thermal infrared (TIR) target tracking, specifically including: 1. **Limitations of feature extraction and fusion**: Traditional TIR tracking methods mainly rely on hand - designed features, and the correlation filtering techniques introduced by deep learning are also limited by basic correlation operations. These methods are difficult to handle complex real - world scenarios and perform poorly when texture and color information are insufficient. 2. **Neglect of spatio - temporal information**: Transformer - based methods often ignore time and coordinate information, which is especially important for TIR tracking lacking in texture and color information. 3. **Challenges of low - detail and low - contrast images**: TIR images usually have a low signal - to - noise ratio and lack rich color information, which makes it difficult to extract discriminative features. To solve these problems, the authors propose a new framework named NLMTrack, with the following main innovations: - **Redefining the TIR tracking task as a language modeling task based on coordinate sequence generation**: By introducing a natural language model, make full use of coordinate and time information. - **Multi - level progressive fusion module**: Enhance semantic representation and introduce multi - scale features to deal with problems such as background interference and target scale changes. - **Adaptive loss function and template update strategy**: Introduce the SIOU loss function to improve tracking accuracy, and adopt a simple template update strategy to adapt to changes in target appearance. ### Specific contributions 1. **For the first time, apply the natural language model to TIR target tracking**, regarding it as a task based on coordinate sequence generation, thus using coordinate and time information more effectively. 2. **Design a multi - level progressive fusion module**, gradually fuse cross - semantic features through a simple feature pyramid, enhancing the semantic understanding of the target and the retention of different - scale information. 3. **Propose an adaptive loss function**, aiming to maximize the log - likelihood of the target sequence and combine SIOU loss to constrain the spatial properties of the bounding box. 4. **Simplify the tracking framework**, perform feature extraction and fusion through a unified encoder, avoid complex information aggregation modules, and improve parallelism and feature discrimination ability. Through these improvements, NLMTrack shows performance superior to existing methods on multiple benchmark datasets, especially showing higher robustness and accuracy when dealing with complex scenarios.