Text-Guided Multi-Modal Fusion for Underwater Visual Tracking

Yonathan Michael,Sajid Javed,Mohamad Alansari
DOI: https://doi.org/10.1109/AVSS61716.2024.10672591
2024-07-15
Abstract:The integration of Natural Language (NL) descriptions with contemporary tracking algorithms constitutes a new and dynamic field, exhibiting no indications of deceleration in the near future. Nevertheless, the absence of comprehensive language descriptions for tracking datasets, particularly in the domain of underwater tracking datasets, presents a substantial impediment to the advancement of this field. Typically, the textual descriptions accompanying these datasets are brief, inadequately informative, lack details regarding relative location or directional movement, and occasionally deviate from the manner in which a human would naturally describe the target in ordinary conversation. In response to this challenge, we propose the development of vividly descriptive NL descriptions tailored for the UVOT400 dataset, which focuses on underwater tracking. These descriptions aim to encapsulate a myriad of factors in order to furnish as comprehensive an understanding as possible regarding the target fish. Subsequent evaluations of these descriptions, conducted in conjunction with contemporary language-based tracking systems, have revealed superior performance in comparison to the best-performing visual-only trackers employed for benchmarking purposes with the aforementioned dataset.
Environmental Science,Engineering,Computer Science
What problem does this paper attempt to address?