TSdetector: Temporal-Spatial Self-correction Collaborative Learning for Colonoscopy Video Detection

Kaini Wang,Haolin Wang,Guang-Quan Zhou,Yangang Wang,Ling Yang,Yang Chen,Shuo Li
2024-09-30
Abstract:CNN-based object detection models that strike a balance between performance and speed have been gradually used in polyp detection tasks. Nevertheless, accurately locating polyps within complex colonoscopy video scenes remains challenging since existing methods ignore two key issues: intra-sequence distribution heterogeneity and precision-confidence discrepancy. To address these challenges, we propose a novel Temporal-Spatial self-correction detector (TSdetector), which first integrates temporal-level consistency learning and spatial-level reliability learning to detect objects continuously. Technically, we first propose a global temporal-aware convolution, assembling the preceding information to dynamically guide the current convolution kernel to focus on global features between sequences. In addition, we designed a hierarchical queue integration mechanism to combine multi-temporal features through a progressive accumulation manner, fully leveraging contextual consistency information together with retaining long-sequence-dependency features. Meanwhile, at the spatial level, we advance a position-aware clustering to explore the spatial relationships among candidate boxes for recalibrating prediction confidence adaptively, thus eliminating redundant bounding boxes efficiently. The experimental results on three publicly available polyp video dataset show that TSdetector achieves the highest polyp detection rate and outperforms other state-of-the-art methods. The code can be available at <a class="link-external link-https" href="https://github.com/soleilssss/TSdetector" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address two major challenges in accurately detecting polyps in colonoscopic videos: intra - sequence distribution heterogeneity and precision - confidence discrepancy. 1. **Intra - sequence distribution heterogeneity**: - This refers to the diversity of feature distributions in a sequence of video frames, specifically manifested as feature differences between consecutive frames due to the dynamic nature of the colonoscopy process. For example, one frame may be clear, while the next frame may be distorted or occluded due to probe movement or other factors. - In endoscopic videos, this heterogeneity includes not only image quality fluctuations caused by motion artifacts and specular reflections, but also changes in the appearance of objects, structures, or backgrounds due to factors such as brightness changes, angle changes, liquid interference, and instrument occlusion. These changes can bring significant uncertainty to the detection algorithm, causing the network's attention to be distracted to irrelevant areas and leading to tracking failures. 2. **Precision - confidence discrepancy**: - This problem occurs when the bounding box with the highest confidence value is not necessarily the true positive sample closest to the ground - truth annotation box. Since the model usually selects the candidate box with the highest confidence score, this deviation may lead to missing the most reliable proposals, while other objects with slightly lower confidence are simply discarded. To address these challenges, the authors propose a new spatio - temporal self - correction detector (TSdetector), which improves the detection effect through the following two self - correction stages: 1. **Temporal - level consistency learning**: - This stage aims to generate more refined proposals by guiding feature extraction and fusion with temporal knowledge. To this end, the authors propose the Global Temporal - aware Convolution (GT - Conv), whose convolution kernel weights are no longer static but are dynamically generated according to temporal context features. This enables GT - Conv to supplement the temporal modeling ability of traditional convolution and further optimize feature encoding. - In addition, a Hierarchical Queue Integration Mechanism (HQIM) is introduced, which is a long - short - term memory network that can capture multi - temporal features in a progressively cumulative manner. HQIM remembers and propagates previous information to the current frame, enhancing feature relevance to adapt to data evolution. 2. **Spatial - level reliability learning**: - This stage aims to reduce the difference between the confidence scores of candidate bounding boxes and the actual positive probabilities. To this end, the authors propose Position - Aware Clustering (PAC), a candidate box selection method based on spatial clustering. PAC uses the relationships between candidate boxes to provide a more comprehensive perspective - adaptive confidence, effectively suppress redundant boxes, retain candidate boxes with the highest overlap with the ground - truth box, and reduce the risk of false positives. In summary, TSdetector compensates for the limitations of traditional CNN detection models by combining temporal and spatial - level optimizations, thereby improving the accuracy and robustness of polyp detection in colonoscopic videos.