Bayesian variable selection using an informed reversible jump in imaging genetics: an application to schizophrenia

Djidenou Montcho,Daiane Zuanetti,Thierry Chekouo,Luis Milan
DOI: https://doi.org/10.48550/arXiv.2307.01134
2023-07-03
Applications
Abstract:Modern attempts in providing predictive risk for complex disorders, such as schizophrenia, integrate genetic and brain information in what is known as imaging genetics. In this work, we propose inferential and predictive methods to relate the presence of a complex disorder, schizophrenia, to genetic and imaging features and predict its risk for new individuals. Given functional Magnetic Resonance Image and Single Nucleotide Polymorphisms information of healthy and people diagnosed with schizophrenia, we use a Bayesian probit model to select discriminating variables, while to estimate the predictive risk, the most promising models are combined using a Bayesian model averaging scheme. For these purposes, we propose an informed reversible jump Markov chain Monte Carlo, named data driven or informed reversible jump, which is scalable to high-dimension data when the number of covariates is much larger than the sample size.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict the risk of schizophrenia by integrating genetic and brain imaging information. Specifically, the researchers proposed a method based on the Bayesian framework for selecting genetic features (single - nucleotide polymorphisms, SNPs) and brain functional regions (ROIs) related to schizophrenia, and using these features to predict an individual's risk of developing schizophrenia. The models and methods proposed in the paper aim to improve the accuracy of early detection of schizophrenia, thereby facilitating the earlier adoption of targeted treatment measures and potentially preventing or delaying the development of the disease. ### Research Background Schizophrenia is a complex multifactorial disease, and its etiology and pathophysiological mechanisms have not been fully elucidated. It is estimated that approximately 1% of the global population is affected by this disease, and common symptoms include hallucinations, delusions, cognitive dysfunction, disorganized thinking, and reduced movement, etc. Currently, the diagnosis of schizophrenia mainly relies on symptom observation, lacking effective medical detection methods. Therefore, developing new methods or tests to assist existing medical tools is of great public health significance. ### Research Methods The researchers used functional magnetic resonance imaging (fMRI) and single - nucleotide polymorphism (SNPs) data from healthy individuals and patients diagnosed with schizophrenia. They adopted a Bayesian Probit model to select discriminant variables and used Bayesian Model Averaging (BMA) to estimate the predicted risk. To achieve this goal, the researchers proposed a Data - Driven Reversible Jump Markov Chain Monte Carlo (DDRJ) algorithm, which can handle high - dimensional data even when the number of covariates is much larger than the sample size. ### Model Description Under the Bayesian framework, the researchers assumed a Probit model in which the unobservable latent variable \( Y^* \) follows a normal distribution: \[ Y_i^*=\beta_0 + \sum_{p \in G} \beta_p X_{ip}+\sum_{k \in M} \alpha_k Z_{ik}+\sum_{k \in M} \delta_k(1 - |Z_{ik}|)+\xi, \quad \xi \sim N(0, 1) \] where: - \( Y_i^* \) is the latent variable of the \( i \) - th individual. - \( \beta_0 \) is the intercept term. - \( \beta_p \) is the influence coefficient of the \( p \) - th ROI. - \( \alpha_k \) and \( \delta_k \) are the additive and dominant effects of the \( k \) - th SNP, respectively. - \( X_{ip} \) is the BOLD intensity of the \( i \) - th individual in the \( p \) - th ROI. - \( Z_{ik} \) is the genotype of the \( i \) - th individual in the \( k \) - th SNP, taking values of \(- 1,0,1\), corresponding to genotypes \( aa, aA, AA \) respectively. - \( \xi \) is the error term, following a standard normal distribution. ### DDRJ Algorithm The core of the DDRJ algorithm lies in efficiently proposing the next model to improve the efficiency of model selection. Specific steps include: 1. **Birth**: Select a variable highly correlated with the current model residuals from the remaining candidate variables and add it to the model. 2. **Death**: Select a variable to be removed from the model according to its importance in the current model (such as the coefficient size). ### Prediction Performance Evaluation The researchers used 5 - fold cross - validation to evaluate the prediction performance of the model, including Misclassification Error (MCE) and Area Under the ROC Curve (AUC). The results showed that DDRJ performed well in all scenarios, was able to accurately select all relevant variables, and was superior to random forests in prediction performance.