Harmonization techniques for machine learning studies using multi-site functional MRI data

Ahmed El-Gazzar,Rajat Mani Thomas,Guido van Wingen
DOI: https://doi.org/10.1101/2023.06.14.544758
2023-06-14
Abstract:Abstract In recent years, the collection and sharing of resting-state functional magnetic resonance imaging (fMRI) datasets across multiple centers have enabled studying psychiatric disorders at scale, and prompted the application of statistically powerful tools such as deep neural networks. Yet, multi-center datasets introduce non-biological heterogeneity that can confound the biological signal of interest and produce erroneous findings. To mitigate this problem, the neuroimaging community has adopted harmonization techniques previously proposed in other domains to remove site-effects from fMRI data. The reported success of these approaches in improving the generalization of the models have varied significantly. It remains unclear whether harmonization techniques could boost the final outcome of multi-site fMRI studies, to what extent, and which approaches are best suited for this task. In an attempt to objectively answer these questions, we conduct a standardized rigorous evaluation of seven different harmonization techniques from the neuroimaging and computer vision literature on two large-scale multi-site datasets ( N = 2169 and N = 2366) to diagnose autism spectrum disorder and major depression disorder from static and dynamic representations of fMRI data. Interestingly, while all harmonization techniques removed site-effects from the data, they had little influence on disorder classification performance in standard k-fold and leave-one-site-out validation settings over a well-tuned baseline. Further investigation shows that the baseline model implicitly learns site-invariant features which could well explain its competitiveness with explicit harmonization techniques and suggest orthogonality between latent disease features and site discrminative features. However, additional experiments show that harmonization methods could be critical to report faithful results in settings where there is high intra-site class imbalance and the learning algorithm is prone to overfit on spurious features confounding the final outcome of the study.
What problem does this paper attempt to address?