How can we do better ? Pitfalls in biomedical challenge design and how to address them

Annika Reinke,Matthias Eisenmann,Sinan Onogur,Marko Stankovic,Patrick Scholz,Tal Arbel,Hrvoje Bogunovic,Andrew P. Bradley,Aaron,Carass,Carolin Feldmann,Alejandro F. Frangi,Peter M. Full,Bram van,Ginneken,Allan Hanbury,Katrin Honauer,Michal Kozubek,A. Bennett,Landman,Keno März,Oskar Maier,Klaus Maier-Hein,Bjoern H. Menze,Henning Müller,Peter F. Neher,Wiro Niessen,Nasir Rajpoot,C. Gregory,Sharp,Korsuk Sirinukunwattana,Stefanie Speidel,Christian Stock,Danail,Stoyanov,Abdel Aziz Taha,Fons van der Sommen,Ching-Wei Wang,Marc-André Weber,Guoyan Zheng,Pierre Jannin,Lena Maier-Hein
2018-01-01
Abstract:Since the first MICCAI grand challenge was organized in 2007 [1], the impact of biomedical image analysis challenges on both the research field as well as on individual careers has been steadily growing. For example, the acceptance of a journal article today often depends on the performance of a new algorithm being assessed against the state-ofthe-art work on publicly available challenge datasets. Furthermore, the results are also important for the individuals scientific careers as well as the potential that algorithms can be translated into clinical practice. Yet, while the publication of papers in scientific journals and prestigious conferences, such as MICCAI, undergoes strict quality control, the design and organization of challenges do not. To investigate the effect of common practice, we have formed an international initiative dedicated to analyzing and improving a variety of aspects related to biomedical challenge design, execution and reporting [2]. In the first part of our abstract presentation at LABELS workshop, we are going to present some of the major pitfalls related to biomedical image analysis challenges today. Specifically, we will look at the following research questions: RQ1: How robust are challenge rankings? What is the effect of – the specific test cases used? – the specific metric variant(s) applied? – the rank aggregation method chosen (e.g. aggregation of metric values with the mean vs median)? ? Shared first/senior authors.
What problem does this paper attempt to address?