Abstract:Deep learning-based computer-aided diagnosis techniques have demonstrated encouraging performance in endoscopic lesion identification and detection, and have reduced the rate of missed and false detections of disease during endoscopy. However, the interpretability of the model-based results has not been adequately addressed by existing methods. This phenomenon is directly manifested by a significant bias in the representation of feature localization. Good recognition models experience severe feature localization errors, particularly for lesions with subtle morphological features, and such unsatisfactory performance hinders the clinical deployment of models. To effectively alleviate this problem, we proposed a solution to optimize the localization bias in feature representations of cancer-related recognition models that is difficult to accurately label and identify in clinical practice. Optimization was performed in the training phase of the model through the proposed data augmentation method and auxiliary loss function based on clinical priors. The data augmentation method, called partial jigsaw, can “break” the spatial structure of lesion-independent image blocks and enrich the data feature space to decouple the interference of background features on the space and focus on fine-grained lesion features. The annotation-based auxiliary loss function used class activation maps for sample distribution correction and led the model to present localization representation converging on the gold standard annotation of visualization maps. The results show that with the improvement of our method, the precision of model recognition reached an average of 92.79%, an F1-score of 92.61%, and accuracy of 95.56% based on a dataset constructed from 23 hospitals. In addition, we quantified the evaluation representation of visualization feature maps. The improved model yielded significant offset correction results for visualized feature maps compared with the baseline model. The average visualization-weighted positive coverage improved from 51.85% to 83.76%. The proposed approach did not change the deployment capability and inference speed of the original model and can be incorporated into any state-of-the-art neural network. It also shows the potential to provide more accurate localization inference results and assist in clinical examinations during endoscopies.

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Visual prior-based cross-modal alignment network for radiology report generation

Improving Medical Vision-Language Contrastive Pretraining with Semantics-aware Triage

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning

Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation

Anatomical Structure-Guided Medical Vision-Language Pre-training

Contrastive Learning of Medical Visual Representations from Paired Images and Text

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Multi-label Recognition of Cancer-Related Lesions with Clinical Priors on White-Light Endoscopy

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Enhancing Biomedical Multi-modal Representation Learning with Multi-scale Pre-training and Perturbed Report Discrimination

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Medical Vision-Language Pre-Training for Brain Abnormalities