An End-to-End Spatial-Temporal Transformer Model for Surgical Action Triplet Recognition

Xiaoyang Zou,Derong Yu,Rong Tao,Guoyan Zheng
DOI: https://doi.org/10.1007/978-3-031-51485-2_14
2024-01-01
Abstract:Surgical activity recognition plays an important role in computer assisted surgery. Recently, surgical action triplet has become the representative definition of fine-grained surgical activity, which is a combination of three components in the form of . In this work, we propose an end-to-end spatial-temporal transformer model trained with multi-task auxiliary supervisions, establishing a powerful baseline for surgical action triplet recognition. Rigorous experiments are conducted on a publicly available dataset CholecT45 for ablation studies and comparisons with state-of-the-arts. Experimental results show that our method outperforms state-of-the-arts by 6.8%, achieving 36.5% mAP for triplet recognition. Our method won the 2nd place in action triplet recognition racing track of CholecTriplet 2022 Challenge, which also demonstrates the superior capability of our method.
What problem does this paper attempt to address?