Abstract:Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available.

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

Object tracking with 3D LIDAR via multi-task sparse learning

PA-Pose: Partial Point Cloud Fusion Based on Reliable Alignment for 6D Pose Tracking

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Language-Conditioned Affordance-Pose Detection in 3D Point Clouds

Joint Representation Learning for Text and 3D Point Cloud

Tracking Objects with 3D Representation from Videos

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Motion-to-Matching: A Mixed Paradigm for 3D Single Object Tracking

3D Multi-object Detection and Tracking with Sparse Stationary LiDAR

Monocular Quasi-Dense 3D Object Tracking

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Unified Scene Representation and Reconstruction for 3D Large Language Models

Lp-slam: language-perceptive RGB-D SLAM framework exploiting large language model

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

2D-3D Pose Tracking with Multi-View Constraints

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning