Abstract:CNN-based object detection models that strike a balance between performance and speed have been gradually used in polyp detection tasks. Nevertheless, accurately locating polyps within complex colonoscopy video scenes remains challenging since existing methods ignore two key issues: intra-sequence distribution heterogeneity and precision-confidence discrepancy. To address these challenges, we propose a novel Temporal-Spatial self-correction detector (TSdetector), which first integrates temporal-level consistency learning and spatial-level reliability learning to detect objects continuously. Technically, we first propose a global temporal-aware convolution, assembling the preceding information to dynamically guide the current convolution kernel to focus on global features between sequences. In addition, we designed a hierarchical queue integration mechanism to combine multi-temporal features through a progressive accumulation manner, fully leveraging contextual consistency information together with retaining long-sequence-dependency features. Meanwhile, at the spatial level, we advance a position-aware clustering to explore the spatial relationships among candidate boxes for recalibrating prediction confidence adaptively, thus eliminating redundant bounding boxes efficiently. The experimental results on three publicly available polyp video dataset show that TSdetector achieves the highest polyp detection rate and outperforms other state-of-the-art methods. The code can be available at <a class="link-external link-https" href="https://github.com/soleilssss/TSdetector" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to address two major challenges in accurately detecting polyps in colonoscopic videos: intra - sequence distribution heterogeneity and precision - confidence discrepancy. 1. **Intra - sequence distribution heterogeneity**: - This refers to the diversity of feature distributions in a sequence of video frames, specifically manifested as feature differences between consecutive frames due to the dynamic nature of the colonoscopy process. For example, one frame may be clear, while the next frame may be distorted or occluded due to probe movement or other factors. - In endoscopic videos, this heterogeneity includes not only image quality fluctuations caused by motion artifacts and specular reflections, but also changes in the appearance of objects, structures, or backgrounds due to factors such as brightness changes, angle changes, liquid interference, and instrument occlusion. These changes can bring significant uncertainty to the detection algorithm, causing the network's attention to be distracted to irrelevant areas and leading to tracking failures. 2. **Precision - confidence discrepancy**: - This problem occurs when the bounding box with the highest confidence value is not necessarily the true positive sample closest to the ground - truth annotation box. Since the model usually selects the candidate box with the highest confidence score, this deviation may lead to missing the most reliable proposals, while other objects with slightly lower confidence are simply discarded. To address these challenges, the authors propose a new spatio - temporal self - correction detector (TSdetector), which improves the detection effect through the following two self - correction stages: 1. **Temporal - level consistency learning**: - This stage aims to generate more refined proposals by guiding feature extraction and fusion with temporal knowledge. To this end, the authors propose the Global Temporal - aware Convolution (GT - Conv), whose convolution kernel weights are no longer static but are dynamically generated according to temporal context features. This enables GT - Conv to supplement the temporal modeling ability of traditional convolution and further optimize feature encoding. - In addition, a Hierarchical Queue Integration Mechanism (HQIM) is introduced, which is a long - short - term memory network that can capture multi - temporal features in a progressively cumulative manner. HQIM remembers and propagates previous information to the current frame, enhancing feature relevance to adapt to data evolution. 2. **Spatial - level reliability learning**: - This stage aims to reduce the difference between the confidence scores of candidate bounding boxes and the actual positive probabilities. To this end, the authors propose Position - Aware Clustering (PAC), a candidate box selection method based on spatial clustering. PAC uses the relationships between candidate boxes to provide a more comprehensive perspective - adaptive confidence, effectively suppress redundant boxes, retain candidate boxes with the highest overlap with the ground - truth box, and reduce the risk of false positives. In summary, TSdetector compensates for the limitations of traditional CNN detection models by combining temporal and spatial - level optimizations, thereby improving the accuracy and robustness of polyp detection in colonoscopic videos.

TSdetector: Temporal-Spatial Self-correction Collaborative Learning for Colonoscopy Video Detection

TSO-DETR: A Network for Small Object Detection of Cervical Cells in TCT Smear

Real-time automatic polyp detection in colonoscopy using feature enhancement module and spatiotemporal similarity correlation unit

An end-to-end tracking method for polyp detectors in colonoscopy videos

A New Framework for Detection of Initial Flat Polyp Candidates Based on a Dual Level Set Competition Model

Real-Time Gastric Polyp Detection Using Convolutional Neural Networks

YOLO-OB: An improved anchor-free real-time multiscale colon polyp detector in colonoscopy

An Efficient Polyp Detection Framework with Suspicious Targets Assisted Training

An Adaptive Regularization Approach to Colonoscopic Polyp Detection Using a Cascaded Structure of Encoder–Decoders

Computer-Aided Colon Polyp Detection on High Resolution Colonoscopy Using Transfer Learning Techniques

ECC-PolypDet: Enhanced CenterNet with Contrastive Learning for Automatic Polyp Detection

SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation

AFP-Net: Realtime Anchor-Free Polyp Detection in Colonoscopy

Artificial intelligence to improve polyp detection and screening time in colon capsule endoscopy

Probabilistic Modeling Ensemble Vision Transformer Improves Complex Polyp Segmentation

Self-supervised Representation Learning Using Feature Pyramid Siamese Networks for Colorectal Polyp Detection

A self-attention based faster R-CNN for polyp detection from colonoscopy images

An automated detection system for colonoscopy images using a dual encoder-decoder model

Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy

Colonoscopy polyp detection with massive endoscopic images

Two‐stage deep‐learning‐based colonoscopy polyp detection incorporating fisheye and reflection correction