TALON: Improving Large Language Model Cognition with Tactility-Vision Fusion

Xinyi Jiang,Guoming Wang,Huanhuan Li,Qinghua Xia,Rongxing Lu,Siliang Tang
DOI: https://doi.org/10.1109/iciea61579.2024.10665031
2024-01-01
Abstract:Current Multimodal Large Language Models (MLLMs) mainly focus on vision and language modalities, often overlooking the integration of other senses, such as tactile perception. In this paper, we present Improving Language Model Cognition with Tactility-Vision Fusion (TALON) to achieve tactility-vision fusion. We first develop a high-density flexible array tactile sensor, Hand-Scan, and deployed it on a data glove. Using the glove, we collect tactile information, and with a camera, we gather visual information to construct the TALON dataset, containing both tactile and visual data. We then train our TALON model using this dataset, achieving modality alignment. Our experiments demonstrate that the TALON model exhibits outstanding recognition capabilities with an accuracy rate of 99.45%, surpassing solely vision-language training (97.58%) and solely tactility-language training (70.47%). Particularly in complex gesture recognition tasks, the accuracy reached 98.82% (+3.06% over vision-language, +18.38% over tactility-language), showcasing the near-perfect performance and proving the effectiveness of tactility-vision fusion.
What problem does this paper attempt to address?