Deriving reproducible biomarkers from multi-site resting-state data: An Autism-based example

Alexandre Abraham,Michael Milham,Adriana Di Martino,R. Cameron Craddock,Dimitris Samaras,Bertrand Thirion,Gaël Varoquaux
DOI: https://doi.org/10.1016/j.neuroimage.2016.10.045
2016-11-18
Abstract:Resting-state functional Magnetic Resonance Imaging (R-fMRI) holds the promise to reveal functional biomarkers of neuropsychiatric disorders. However, extracting such biomarkers is challenging for complex multi-faceted neuropatholo-gies, such as autism spectrum disorders. Large multi-site datasets increase sample sizes to compensate for this complexity, at the cost of uncontrolled heterogeneity. This heterogeneity raises new challenges, akin to those face in realistic diagnostic applications. Here, we demonstrate the feasibility of inter-site classification of neuropsychiatric status, with an application to the Autism Brain Imaging Data Exchange (ABIDE) database, a large (N=871) multi-site autism dataset. For this purpose, we investigate pipelines that extract the most predictive biomarkers from the data. These R-fMRI pipelines build participant-specific connectomes from functionally-defined brain areas. Connectomes are then compared across participants to learn patterns of connectivity that differentiate typical controls from individuals with autism. We predict this neuropsychiatric status for participants from the same acquisition sites or different, unseen, ones. Good choices of methods for the various steps of the pipeline lead to 67% prediction accuracy on the full ABIDE data, which is significantly better than previously reported results. We perform extensive validation on multiple subsets of the data defined by different inclusion criteria. These enables detailed analysis of the factors contributing to successful connectome-based prediction. First, prediction accuracy improves as we include more subjects, up to the maximum amount of subjects available. Second, the definition of functional brain areas is of paramount importance for biomarker discovery: brain areas extracted from large R-fMRI datasets outperform reference atlases in the classification tasks.
Machine Learning,Neurons and Cognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to extract reproducible biomarkers using multi - center resting - state functional magnetic resonance imaging (R - fMRI) data, especially for autism spectrum disorder (ASD). Specifically, the researchers face the following challenges: 1. **Data heterogeneity**: Although large multi - center data sets increase the sample size, they also introduce uncontrolled heterogeneity, which poses new challenges in practical diagnostic applications. For example, different research centers may use different MRI acquisition protocols, participant instructions (such as open or closed eyes), recruitment strategies (such as age groups, IQ ranges, impairment levels, treatment histories, and acceptable comorbidities), etc. These differences will affect the extraction of biomarkers and the accuracy of diagnosis. 2. **Reproducibility and generalization ability of biomarkers**: Although previous studies have shown that R - fMRI can be used to identify biomarkers, the reproducibility and generalization ability of these methods in research or clinical settings are still controversial. The sample sizes of most R - fMRI studies are small, and the differences in data acquisition, image processing, and sampling strategies across studies have not been quantified. 3. **Robustness of the prediction model**: In order to evaluate the generalization ability of the model, researchers need to use unseen data for testing, that is, cross - validation. However, traditional cross - validation strategies usually do not consider potential site - specific confounding factors. Therefore, this study measures the performance of the model in the presence of uncontrolled variation by excluding the entire site, thereby more realistically simulating the situation in the clinical environment. 4. **Selection of the data processing flow**: Different steps in the functional connectivity data processing flow (such as brain region definition, time - series extraction, matrix estimation, and classification) also have a great impact on the results. The lack of ground truth of the functional architecture makes it difficult to validate the R - fMRI data processing flow. Therefore, researchers need to evaluate different data processing options to determine the optimal parameter - free processing flow. By solving the above problems, the researchers aim to prove the possibility of reliably learning cross - site biomarkers of mental states from multi - center heterogeneous data and provide an effective R - fMRI neuro - phenotypic extraction pipeline.