Abstract:Outstanding advances have been made in visual learning methods for object recognition. However, machine vision recognition methods would lose their effectiveness when objects are visually indistinguishable. Since object tactile learning can access information that is not available for visual learning, it provides an important alternative form for object recognition. As a result, methods that integrate visual and tactile learning to recognize objects have been explored. There is a clear gap between visual and tactile information, and this limitation becomes more and more prominent with the development of visual-tactile learning. Most existing visual-tactile fusion learning methods lack effective fusion mechanisms to handle different tactile information types and lack sufficient accuracy to meet practical industrial needs. In this article, we propose a visual-tactile fusion network (VITO-Transformer) for object recognition to cope with these problems. Specifically, we design a special mechanism that can fuse visual and tactile information based on the transformer network to solve the problem that it is difficult to fuse visual and tactile information due to their large differences. Thanks to this special fusion mechanism, the accuracy of object recognition is substantially improved. Finally, a large number of comparative experiments are conducted on publicly available and self-made visual-tactile datasets to verify the advantages of the proposed VITO-Transformer and validate the effectiveness of the proposed fusion mechanism by comparing it with the current popular network algorithms. In this article, the proposed VITO-Transformer network can process different tactile information through a special tactile fusion mechanism, which brings a new solution to the field of visual-tactile fusion development.

VTFEFN: An End-to-End Visual-Tactile Feature Extraction and Fusion Network

Fusion of Low-Illuminance Visible and Near-Infrared Images Based on Convolutional Neural Networks

VITO-Transformer: A Visual-Tactile Fusion Network for Object Recognition

TCCFusion: An Infrared and Visible Image Fusion Method based on Transformer and Cross Correlation

DTFusion: Infrared and Visible Image Fusion Based on Dense Residual PConv-ConvNeXt and Texture-Contrast Compensation

THFuse: An Infrared and Visible Image Fusion Network using Transformer and Hybrid Feature Extractor

MEEAFusion: Multi-Scale Edge Enhancement and Joint Attention Mechanism Based Infrared and Visible Image Fusion

RTFusion: A Multimodal Fusion Network with Significant Information Enhancement

Visual-Tactile Fusion for Robotic Stable Grasping

TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction

A Human-Like Siamese-Based Visual-Tactile Fusion Model for Object Recognition

HDCTfusion: Hybrid Dual-Branch Network Based on CNN and Transformer for Infrared and Visible Image Fusion

MFFNet: Multi-modal Feature Fusion Network for V-D-T Salient Object Detection

TDDFusion: A Target-Driven Dual Branch Network for Infrared and Visible Image Fusion

Efficient Spatio-Temporal Tactile Object Recognition With Randomized Tiling Convolutional Networks In A Hierarchical Fusion Strategy

HitFusion: Infrared and Visible Image Fusion for High-Level Vision Tasks Using Transformer

A Late Fusion Approach for Harnessing Multi-Cnn Model High-Level Features

TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection.

GTMFuse: Group-Attention Transformer-Driven Multiscale Dense Feature-Enhanced Network for Infrared and Visible Image Fusion

Visual–Tactile Fusion Object Recognition Using Joint Sparse Coding

A Vision Enhancement and Feature Fusion Multiscale Detection Network