Abstract:Reliable models are dependable and provide predictions acceptable given basic domain knowledge. Therefore, it is critical to develop and deploy reliable models, especially for healthcare applications. However, Multiple Instance Learning (MIL) models designed for Whole Slide Images (WSIs) classification in computational pathology are not evaluated in terms of reliability. Hence, in this paper we compare the reliability of MIL models with three suggested metrics and use three region-wise annotated datasets. We find the mean pooling instance (MEAN-POOL-INS) model more reliable than other networks despite its naive architecture design and computation efficiency. The code to reproduce the results is accessible at <a class="link-external link-https" href="https://github.com/tueimage/MILs'R" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the reliability of Multiple Instance Learning (MIL) models for Whole Slide Images (WSIs) classification. Specifically, the paper points out that the current MIL models lack quantitative evaluation of reliability in the application of computational pathology. Therefore, the authors propose to use three quantitative indicators to evaluate the reliability of MIL models and conduct experimental verification through three region - annotated datasets. ### Problem Background 1. **Importance of MIL Models**: - MIL is a weakly - supervised classification method and is widely used in computational pathology because obtaining pixel - level annotations of histopathological data is usually very time - consuming. - MIL models classify by creating bags of instances and predicting the labels of these bags. 2. **Importance of Reliability**: - A reliable model can provide acceptable predictions that are in line with basic domain knowledge, which is crucial for medical applications. - In computational pathology, a reliable model should focus on biologically consistent features supported by scientific evidence for prediction. 3. **Limitations of Existing Evaluation Methods**: - At present, most studies only evaluate the reliability of models through qualitative evaluation (such as showing specific slides and their heat maps). - Qualitative evaluation cannot comprehensively cover all slides in the test set and requires pathological knowledge, which is not suitable for machine - learning researchers. ### Main Contributions of the Paper 1. **Proposing Quantitative Evaluation Methods**: - Use three quantitative indicators: Mutual Information (MI), Spearman’s Correlation (rs), and Area Under The Precision - Recall Curve (PR - AUC) to evaluate the reliability of MIL models. 2. **Experimental Verification**: - Use three public WSI datasets (Camelyon16, CATCH, and TCGA BRCA) for experiments to ensure comprehensiveness and diversity of evaluation. 3. **Findings and Conclusions**: - The MEAN - POOL - INS model shows high reliability despite its simple architecture and high computational efficiency. - Multi - head attention models (such as ACMIL and MADMIL) perform well in terms of classification performance and reliability but have high computational costs. - Increase the attention to model reliability and computational cost, and it is recommended to consider these indicators simultaneously when developing new models. Through these works, the authors hope to promote the application of more reliable and lightweight MIL models in WSI classification.

Quantitative Evaluation of MILs' Reliability For WSIs Classification

Learning Hybrid Negative Probability Model for Weakly-Supervised Whole Slide Image Recognition.

SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology

Iterative multiple instance learning for weakly annotated whole slide image classification

Establishing Causal Relationship Between Whole Slide Image Predictions and Diagnostic Evidence Subregions in Deep Learning

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

Whole Slide Images based Cancer Survival Prediction using Attention Guided Deep Multiple Instance Learning Networks

IIB-MIL: Integrated Instance-Level and Bag-Level Multiple Instances Learning with Label Disambiguation for Pathological Image Analysis

Attention2Minority: A salient instance inference-based multiple instance learning for classifying small lesions in whole slide images

TPMIL: Trainable Prototype Enhanced Multiple Instance Learning for Whole Slide Image Classification

Long-MIL: Scaling Long Contextual Multiple Instance Learning for Histopathology Whole Slide Image Analysis

RetMIL: Retentive Multiple Instance Learning for Histopathological Whole Slide Image Classification

MICIL: Multiple-Instance Class-Incremental Learning for skin cancer whole slide images

FR-MIL: Distribution Re-calibration based Multiple Instance Learning with Transformer for Whole Slide Image Classification

MAMILNet: advancing precision oncology with multi-scale attentional multi-instance learning for whole slide image analysis

MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models

Advances in Multiple Instance Learning for Whole Slide Image Analysis: Techniques, Challenges, and Future Directions

Predicting clinical endpoints and visual changes with quality-weighted tissue-based renal histological features

Distilling High Diagnostic Value Patches for Whole Slide Image Classification Using Attention Mechanism

Rethinking Pre-trained Feature Extractor Selection in Multiple Instance Learning for Whole Slide Image Classification

Boosting Whole Slide Image Classification from the Perspectives of Distribution, Correlation and Magnification