Abstract:Thermal infrared (TIR) tracking is pivotal in computer vision tasks due to its all-weather imaging capability. Traditional tracking methods predominantly rely on hand-crafted features, and while deep learning has introduced correlation filtering techniques, these are often constrained by rudimentary correlation operations. Furthermore, transformer-based approaches tend to overlook temporal and coordinate information, which is critical for TIR tracking that lacks texture and color information. In this paper, to address these issues, we apply natural language modeling to TIR tracking and propose a coordinate-aware thermal infrared tracking model called NLMTrack, which enhances the utilization of coordinate and temporal information. NLMTrack applies an encoder that unifies feature extraction and feature fusion, which simplifies the TIR tracking pipeline. To address the challenge of low detail and low contrast in TIR images, on the one hand, we design a multi-level progressive fusion module that enhances the semantic representation and incorporates multi-scale features. On the other hand, the decoder combines the TIR features and the coordinate sequence features using a causal transformer to generate the target sequence step by step. Moreover, we explore an adaptive loss aimed at elevating tracking accuracy and a simple template update strategy to accommodate the target's appearance variations. Experiments show that NLMTrack achieves state-of-the-art performance on multiple benchmarks. The Code is publicly available at \url{<a class="link-external link-https" href="https://github.com/ELOESZHANG/NLMTrack" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in thermal infrared (TIR) target tracking, specifically including: 1. **Limitations of feature extraction and fusion**: Traditional TIR tracking methods mainly rely on hand - designed features, and the correlation filtering techniques introduced by deep learning are also limited by basic correlation operations. These methods are difficult to handle complex real - world scenarios and perform poorly when texture and color information are insufficient. 2. **Neglect of spatio - temporal information**: Transformer - based methods often ignore time and coordinate information, which is especially important for TIR tracking lacking in texture and color information. 3. **Challenges of low - detail and low - contrast images**: TIR images usually have a low signal - to - noise ratio and lack rich color information, which makes it difficult to extract discriminative features. To solve these problems, the authors propose a new framework named NLMTrack, with the following main innovations: - **Redefining the TIR tracking task as a language modeling task based on coordinate sequence generation**: By introducing a natural language model, make full use of coordinate and time information. - **Multi - level progressive fusion module**: Enhance semantic representation and introduce multi - scale features to deal with problems such as background interference and target scale changes. - **Adaptive loss function and template update strategy**: Introduce the SIOU loss function to improve tracking accuracy, and adopt a simple template update strategy to adapt to changes in target appearance. ### Specific contributions 1. **For the first time, apply the natural language model to TIR target tracking**, regarding it as a task based on coordinate sequence generation, thus using coordinate and time information more effectively. 2. **Design a multi - level progressive fusion module**, gradually fuse cross - semantic features through a simple feature pyramid, enhancing the semantic understanding of the target and the retention of different - scale information. 3. **Propose an adaptive loss function**, aiming to maximize the log - likelihood of the target sequence and combine SIOU loss to constrain the spatial properties of the bounding box. 4. **Simplify the tracking framework**, perform feature extraction and fusion through a unified encoder, avoid complex information aggregation modules, and improve parallelism and feature discrimination ability. Through these improvements, NLMTrack shows performance superior to existing methods on multiple benchmark datasets, especially showing higher robustness and accuracy when dealing with complex scenarios.

Coordinate-Aware Thermal Infrared Tracking Via Natural Language Modeling

A Fourier-Transform-Based Framework with Asymptotic Attention for Mobile Thermal InfraRed Object Detection

Learning Dual-Level Deep Representation for Thermal Infrared Tracking

Thermal infrared object tracking via unsupervised deep correlation filters

Exploiting Temporal Coherence for Self-Supervised Visual Tracking by Using Vision Transformer

MATI: Multimodal Adaptive Tracking Integrator for Robust Visual Object Tracking

Exploring reliable infrared object tracking with spatio-temporal fusion transformer

Context-Aware Integration of Language and Visual References for Natural Language Tracking

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal Representation

Efficient thermal infrared tracking with cross-modal compress distillation

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

Visible and Infrared Object Tracking via Convolution-Transformer Network With Joint Multimodal Feature Learning

Target-Aware Tracking with Long-term Context Attention

Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking

Synthetic Data Generation for End-to-End Thermal Infrared Tracking

Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking

Real-time Visual Object Tracking with Natural Language Description

Representation Alignment Contrastive Regularization for Multi-Object Tracking

Infrared Small Target Tracking Based on OSTrack Model

Towards Real-World Visual Tracking with Temporal Contexts