Method for Audio-to-Tactile Cross-Modality Generation Based on Residual U-Net

Yan Zhan,Xiaoying Sun,Qinglong Wang,Weizhi Nai
DOI: https://doi.org/10.1109/tim.2023.3336453
IF: 5.6
2023-01-01
IEEE Transactions on Instrumentation and Measurement
Abstract:Extensive research has been conducted investigating various facets in the realm of multimodal modeling. Nevertheless, there has been a lack of systematic exploration in the literature regarding audio-to-tactile cross-modality generation, particularly concerning tool-surface texture interaction. Existing studies have indicated a significant perceptual correlation between auditory and tactile stimuli, offering intriguing opportunities for creating tactile experiences based on sound. Thus, this article proposes a cross-modal network framework that generates tactile vibration signals based on audio, thereby facilitating high-fidelity tactile rendering of textured surfaces. The successful Residual U-Net network model effectively converts sound signals into vibrational tactile signals using time-frequency representations, which are then displayed through a preexisting vibration device. In addition, an audio-to-tactile cross-modality dataset is constructed to train the proposed deep learning architecture. Experimental results demonstrate that the proposed generative model can generate vibrational tactile signals that are visually and statistically close to the ground truth. In particular, the average structural similarity index (SSIM) of the generated temporal data reaches 0.8013. Subsequent user studies on perceived texture similarity indicate that users are unable to distinguish between the generated signal presented on the preexisting vibration device and the ground-truth signal. Moreover, user studies evaluating the realism of our generated signals for rough textures achieved scores ranging from 5.77 to 6.36. This highlights the effectiveness of our audio-to-tactile cross-modality generative model.
What problem does this paper attempt to address?