Efficient Vision Transformer with Token-Selective and Merging Strategies for Autonomous Underwater Vehicles

Yu Jiang,Yongji Zhang,Yuehang Wang,Qianren Guo,Minghao Zhao,Hongde Qin
DOI: https://doi.org/10.1109/jiot.2024.3422389
IF: 10.6
2024-01-01
IEEE Internet of Things Journal
Abstract:Underwater fine-grained classification technology is crucial for discerning subtle differences among marine life classes, playing a pivotal role in marine resource exploration and the discovery of new species. Autonomous underwater vehicles equipped with this technology can enhance their environmental interaction and perception, providing critical data for Internet of Underwater Things (IoUT) systems. However, popular vision transformer (ViT)-based methods encounter challenges in complex marine environments, particularly due to limited computational resources. In this article, we introduce an efficient ViT with token-selective and merging strategies (TSMVTs), which significantly improves underwater fine-grained classification performance while reducing the number of processed tokens. TSMVT can be flexibly integrated into various IoUT systems, promoting the discovery of new species and the sustainable development of marine ecology. First, we propose a dynamic token filtering mechanism that effectively retains important tokens, merges low-information tokens, and discards irrelevant background tokens, significantly reducing computational demands. Second, we propose the multihead attention weighting token-selective (MAWTS) module, which dynamically adjusts attention weights. MAWTS enables the network to focus on key features, such as fin shape, head structure, and body proportions, thereby improving classification accuracy. With a 30% reduction in tokens, TSMVT achieves superior precision in classifying marine species, enhancing its applicability on various underwater mobile platforms. Extensive experiments conducted on four marine and three terrestrial data sets demonstrate the outstanding accuracy and efficiency of the proposed TSMVT.
What problem does this paper attempt to address?