CWT-ViT: A Time-Frequency Representation and Vision Transformer-Based Framework for Automated Robotic Surgical Skill Assessment

Yiming Zhang,Ying Weng,Boding Wang
DOI: https://doi.org/10.1016/j.eswa.2024.125064
IF: 8.5
2024-01-01
Expert Systems with Applications
Abstract:Surgical skill assessment currently hinges on the manual observations of senior surgeons, and the assessment process is inherently time-consuming and subjective. Hence, there is a need to develop machine learning-based automated robotic surgical skill assessment. However, the existing machine learning-based works are only built in either the time domain or frequency domain but have never considered the investigation on the time-frequency domain. To fill the research gap, we explore the representation of the surgery motion data in the time-frequency domain. In this study, we propose a novel automated robotic surgical skill assessment framework called Continuous Wavelet Transform-Vision Transformer (CWT-ViT). We apply continuous wavelet transform, i.e., a time-frequency representation method, to convert robotic surgery kinematic data to synthesis images. Furthermore, by taking advantage of the prior knowledge of the da Vinci surgical system, we design a four branches-based architecture, each branch representing a robotic manipulator. We have conducted extensive experiments and achieved comparable results on the benchmark robotic surgical skill dataset JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS). Our proposed CWT-ViT framework has demonstrated the feasibility of applying time-frequency representation for automated robotic surgical skill assessment using kinematic data. The code is available at https://github.com/yiming95/CWT-ViT-Surgery.
What problem does this paper attempt to address?