Abstract:Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development of a temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial feature extractor is designed to capture critical temporal information within the surgical video frames. The phase transition map provides essential insights into the dynamic transitions between different surgical phases. LoViT combines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then leverages the temporally-rich spatial features and phase transition map to classify surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics. The code will be available at https://github.com/MRUIL/LoViT .

ViT-MPI: Vision Transformer Multiplane Images for Surgical Single-View View Synthesis.

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

Stereo Vision Conversion from Planar Videos Based on Temporal Multiplane Images

Mastoidectomy Multi-View Synthesis from a Single Microscopy Image

DP-MVS: Detail Preserving Multi-View Surface Reconstruction of Large-Scale Scenes

MMViT: Multiscale Multiview Vision Transformers

MPViT: Multi-Path Vision Transformer for Dense Prediction

View synthesis with multiplane images from computationally generated RGB-D light fields

Stereo Magnification: Learning View Synthesis using Multiplane Images

E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

SinMPI: Novel View Synthesis from a Single Image with Expanded Multiplane Images

MMMViT: Multiscale multimodal vision transformer for brain tumor segmentation with missing modalities

MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation

LoViT: Long Video Transformer for surgical phase recognition

MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets

Lightweight Multiplane Images Network for Real-Time Stereoscopic Conversion from Planar Video

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

Tiled Multiplane Images for Practical 3D Photography