SiamMMF: multi-modal multi-level fusion object tracking based on Siamese networks

Zhen Yang,Peng Huang,Dunyun He,Zhongwang Cai,Zhijian Yin
DOI: https://doi.org/10.1007/s00138-022-01354-2
IF: 2.983
2022-12-06
Machine Vision and Applications
Abstract:Feature-level or pixel-level fusion is a common technique for integrating different modes of information in RGB-T object tracking. A good fusion method between modalities can significantly improve the tracking performance. In this paper, a multi-modal and multi-level fusion model based on Siamese network (SiamMMF) is proposed. SiamMMF consists of two main subnetworks: a pixel-level fusion network and a feature-level fusion network. The pixel-level fusion network fuses the infrared images and the visible light images by taking the maximum values of the pixels corresponding to the different images, and the combined images are used to replace the visible light images. The infrared images and the visible light images are each input to the backbone with dual-stream structure for processing. After the extraction of deep features, the visible and infrared features from the two branches are cross-correlated to obtain a fusion result that is sent to the tracking head for tracking. Based on numerous experiments, it was found that the best tracking effect is obtained when the weighting ratio between the visible and infrared modality is set to 6:4. Nineteen pairs of RGB-T video sequences with different attributes were used to test our model and compared it with 15 trackers. For the two evaluation criteria, success rate and precision rate, our network achieved the best results.
computer science, cybernetics, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?