Statistical and Network Analysis of Metabolomics Data
Ehsan Ullah,Raghvendra Mall,Reda Rawi,Halima Bensmail
DOI: https://doi.org/10.1145/2975167.2985683
2016-10-02
Abstract:Metabolomics encompasses analysis of metabolites using profiling techniques such as mass spectroscopy (MS) and nuclear magnetic resonance (NMR). Statistical analysis is performed on the profiled data to determine variations in the levels of metabolites. The goal here is to reveal relationships between the variations in the concentrations of metabolites and specific pathophysiological conditions such as diseases or external factors. Metabolomics has been widely used to characterize metabolites in various body fluids such as saliva, serum and urine in various fields of medical research including cancer [3], cardialogy [6], diabetes [5], human infections [12], neurology [7], neonatology [4] and respiratory diseases [2] to name a few. In the statistical analysis of metabolomics data, many methods are used which can be categorized as univariate and multivariate analysis methods. Univariate methods are very commonly applied due to their ease of use and interpretation. These methods consider metabolomic features (variables) one at a time independent of each other, thus, ignoring correlations with other features. Moreover, as pointed by Alonso et al. [1], these methods ignore confounding variables such as age, gender, body mass index (BMI), which may lead to incorrect results [13, 15]. On the other hand, multivariate methods consider all the features and their correlations during data analysis. These methods include unsupervised methods such as principal component analysis (PCA), and supervised methods such as partial least squares (PLS) and support vector machine (SVM). Alonso et al. has provided a review of univariate and multivariate methods used in metabolomics. To the best of our knowledge, there are many state of the art statistical methods that have not be used for metabolomic data analysis. A significant advantage of these methods over commonly used methods is their ability to process high-dimensional data. Along with state-of-the-art statistical methods we have used differential network analysis to identify variations at system level. In this work we have analyzed urine samples from Qatar Metabolomics Study on Diabetes (QMDiab) for identification of potential biomarkers. QMDiab was conducted by Hamad Medical Corporation, Qatar (HMC) and Weill Cornell Medical College, Qatar in 2012 with approval from the Institutional Review Boards of HMC and Weill Cornell Medical College-Qatar (Research Protocol number 11131/11). Written informed consent was obtained from all participants. Subjects in the study included males and females from Arab and Asian ethnicities aging 17-81 years. Urine samples were sent to Chenomx Inc., Alberta, Canada for proton nuclear magnetic resonance (1H NMR). Although the original study was targeting investigation of type 2 diabetes, in this paper we are focusing on obesity as well by using BMI as a representative measure of obesity. In this work we have used regularization models and differential network analysis. We have used the elastic net, glinternet, the lasso projection and high-dimensional inference. The elastic net uses L1 and L2 penalty resulting in a mix of ridge and lasso regression. The glinternet is a group-lasso based method developed by Lim and Hastie [9]. The method learns pairwise interactions of variables in linear regression models satisfying strong hierarchy. The lasso projection (lasso proj) or de-sparsified lasso is a regularization based method that performs statistical inference of low dimensional parameters with high dimensional data [17]. The method uses low dimension projection approach to construct confidence intervals for the estimated regression parameters. The high-dimensional inference computes P-values of variables and associated confidence intervals in high-dimensional data [10]. Further, we performed differential network analysis to identify variable interactions, which differentiate between diabetic and non-diabetic, or obese and lean subjects. The network is constructed using mutual information between the variables for different groups of samples. We applied the differential network analysis, dGHD algorithm, proposed by Ruan et al. [14] for detecting interaction patterns, which differentiate two networks. The algorithm uses the Generalised Hamming Distance (GHD) for calculating topological differences between the networks along with computation of their statistical significance. It is astonishing that the proposed methods, which have not been applied in the field yet, identify potential biomarkers, proposed in the literature by previous studies, in a small dataset. The results for the elastic net, the glinternet and the lasso proj are summarized in Table 1. For diabetes analysis, identified significant variables include age, betaine, glycolate and glucose, well known biomarkers for diabetes [8, 11]. For obesity analysis, identified significant variables include age, dimethylamine, succinate and cis-aconitate, previously identified by [16]. The high-dimensional inference only identified age and betaine for diabetes study. We conclude that state-of-the-art statistical and network analysis methods can be used for metabolomics data analysis for datasets with limited number of samples. The number of metabolomic features is increasing with the advancement of technologies. The ability of these methods to handle high-dimensional data make them suitable in the settings where the number of samples is smaller than the number of features. These methods can help in identification potential biomarkers in future studies.