Abstract:Background: The reproducibility crisis in AI research remains a significant concern. While code sharing has been acknowledged as a step toward addressing this issue, our focus extends beyond this paradigm. In this work, we explore "federated testing" as an avenue for advancing reproducible AI research and development especially in medical imaging. Unlike federated learning, where a model is developed and refined on data from different centers, federated testing involves models developed by one team being deployed and evaluated by others, addressing reproducibility across various implementations. Methods: Our study follows an exploratory design aimed at systematically evaluating the sources of discrepancies in shared model execution for medical imaging and outputs on the same input data, independent of generalizability analysis. We distributed the same model code to multiple independent centers, monitoring execution in different runtime environments while considering various real-world scenarios for pre- and post-processing steps. We analyzed deployment infrastructure by comparing the impact of different computational resources (GPU vs. CPU) on model performance. To assess federated testing in AI models for medical imaging, we performed a comparative evaluation across different centers, each with distinct pre- and post-processing steps and deployment environments, specifically targeting AI-driven positron emission tomography (PET) imaging segmentation. More specifically, we studied federated testing for an AI-based model for surrogate total metabolic tumor volume (sTMTV) segmentation in PET imaging: the AI algorithm, trained on maximum intensity projection (MIP) data, segments lymphoma regions and estimates sTMTV. Results: Our study reveals that relying solely on open-source code sharing does not guarantee reproducible results due to variations in code execution, runtime environments, and incomplete input specifications. Deploying the segmentation model on local and virtual GPUs compared to using Docker containers showed no effect on reproducibility. However, significant sources of variability were found in data preparation and pre-/post- processing techniques for PET imaging. These findings underscore the limitations of code sharing alone in achieving consistent and accurate results in federated testing. Conclusion: Achieving consistently precise results in federated testing requires more than just sharing models through open-source code. Comprehensive pipeline sharing, including pre- and post-processing steps, is essential. Cloud-based platforms that automate these processes can streamline AI model testing across diverse locations. Standardizing protocols and sharing complete pipelines can significantly enhance the robustness and reproducibility of AI models.

Beyond Knowledge Silos: Task Fingerprinting for Democratization of Medical Imaging AI

Workflow Integration of Research AI Tools into a Hospital Radiology Rapid Prototyping Environment

Unlocking biomedical data sharing: A structured approach with digital twins and artificial intelligence (AI) for open health sciences

Knowledge AI: New Medical AI Solution for Medical image Diagnosis

Evaluating Knowledge Transfer in Neural Network for Medical Images

Why does my medical AI look at pictures of birds? Exploring the efficacy of transfer learning across domain boundaries

FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification

Making sense of radiomics: insights on human–AI collaboration in medical interaction from an observational user study

End-to-end reproducible AI pipelines in radiology using the cloud

A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis

From code sharing to sharing of implementations: Advancing reproducible AI development for medical imaging through federated testing

AI and the democratization of knowledge

Tesseract-medical imaging: open-source browser-based platform for artificial intelligence deployment in medical imaging

Democratizing Artificial Intelligence in Healthcare: A Study of Model Development Across Two Institutions Incorporating Transfer Learning

Discovery Viewer (DV): Web-Based Medical AI Model Development Platform and Deployment Hub

The Limits of Fair Medical Imaging AI In The Wild

Objective task-based evaluation of artificial intelligence-based medical imaging methods: Framework, strategies and role of the physician

Parallel Medical Imaging for Intelligent Medical Image Analysis: Concepts, Methods, and Applications

bAIoimage analysis: elevating the rate of scientific discovery -- as a community

Mind the Gap: Federated Learning Broadens Domain Generalization in Diagnostic AI Models

Unleashing the power of advanced technologies for revolutionary medical imaging: pioneering the healthcare frontier with artificial intelligence