Gearing up software systems for genome data : a case study with SAS
J. Zhao,Q. Tan,J. Luan,Shengxu Li,F. Xue,W. Qian,J. Ruth,F. Loos,N. Wareham
Abstract:Summary: We described implementation of analysis for genomewide association studies (GWASs) followed by a brief comparison with alternative implementation and software. We expect they will be useful for GWASs and studies with many variables. Availability: The software and supplementary information are available at http://www.mrc-epid.cam.ac.uk/~jinghua.zhao. Contact: jinghua.zhao@mrc-epid.cam.ac.uk Supplementary information: Supplementary materials are available at Bioinformatics online. Large-scale genome data have underlain the recent discoveries of genes or single nucleotide polymorphisms (SNPs) associated with diseases or other traits in humans. While the major scientific aspects in these genome-wide association studies (GWASs) are well recognized (Couzin and Kaiser, 2007; Altshuler, et al., 2008; Donnelly, 2008; Pearson and Manolio 2008), we believe that many practical issues are wide open. We previously argued the case for analysis of genomic data in general software systems such as SAS or R for data management, programming, validation and development of analytical tools (Zhao and Tan, 2006a). SAS has been widely used by both the industries and academic institutions and R is a free, open source statistical and programming environment that is increasingly popular (Vance, 2009a,b). Here we will focus on SAS, for we have used it for genetic data before most of the recent tools for GWASs became available. Despite its advantage, our experience (Zhao, et al., 2007; Loos, et al., 2008) with a study of ~4000 population-based individuals each with Affymetrix 500K GeneChip data has shown that additional challenges present themselves when tackling genomic analysis. In particular, we found that data partition (i.e., into smaller subsets of SNPs) was necessary for calculation of per SNP summary statistics and association testing. We here present procedures that can be used to overcome this problem along with other developments. These include 1. converting from an individual-by-SNP format (one column per SNP) to a more desirable SNP-byindividual format (one column per individual), with the option to include allele labels from a map file for direct coding of SNPs as will be detailed below, 2. placing the phenotype and covariates in a *To whom correspondence should be addressed. separate file for greater flexibility, 3. using established procedures for analysis involving haplotypes, imputed genotypes, metaanalysis, and 4. obtaining individual data or summary statistics efficiently obtained from HapMap (The International HapMap Consortium, 2007). These procedures simplify the implementation of analyses a great deal and allow researchers to undertake a wide range of analysis effectively. We will elaborate the data organization first. With many SNPs and the prospect of involving the whole genome, the SNP-byindividual format is preferable but nevertheless non-standard for most general statistical systems. This is because per SNP association model would involve specification of a particular SNP name, it would be laborious to literally lay out all models from data with an individual-by-SNP format. To get around this we have implemented an algorithm such that for each SNP, we read map information and iteratively proceed through individuals’ genotype column-wise together with a record in the trait file containing phenotype and covariates corresponding to the particular column in the genotype file, and all three sources of information are stored in a long format file. We can keep track of successful genotype calls per SNP. The long file can be processed by SAS/GENETICS with a BY statement plus the notsorted option, which really instructs a per SNP analysis. With additional indicators in the genotype and phenotype files, one can focus on a subset of SNPs and individuals without much overhead. Although it is a common practice to designate alleles to be minor or major according to their population frequencies, and to code alleles of SNPs according the type or number of effect alleles, we found this to be a bottleneck for our previous implementation because we needed to count the number of alleles, pick up the minor allele and compare against the original genotypes. We now used an operational rule of keeping allele labels (i.e., A, C, G, T) in alphabetical order and treating the second allele to be effective so that additive or other types of coding can be done per genotype at runtime when the long file is generated. It is easily seen that for interpretation one only needs to swap the roles of minor and major alleles whenever appropriate, although minor allele is more an indication of the amount of information in our sample concerned about a particular SNP. This also allows for analysis via standard procedures other than those available from SAS/GENETICS, such Gearing up software systems for genome data. 2 as REG and LOGISTIC for linear and logistic regression analysis. We note that a typical SAS program divides tasks into data preparation (DATA step) and analysis (PROC step). Although it is possible to mix them in order to avoid the extra storage, the programming would be more involved and less efficient than what we have just described. While the implementation is surprisingly short for most per SNP analyses, it is also computationally fast, memory efficient and SAS/BASE alone can furnish the necessary step for following analysis. Example using our EPIC-Norfolk data is provided as Supplementary materials. This suggested that it is viable to script tools such as awk yet more transparent and flexible. The long file uses more disk space when many variables are included in the analysis but a temporary use is often acceptable, especially for a subset of SNPs of interest. In addition, our example extends to family data by keeping identifiers for pedigrees, individuals, and parents in the phenotype file as key components in pedigree data analysis. The remaining aspects we would like to mention are in regard to haplotypes, imputed genotypes, meta-analysis and publicly available data. Through SURVEYREG and SURVEYLOGISTIC procedures we can use posterior probability of probabilities of haplotype assignments, imputed genotypes, while meta-analysis can be facilitated by procedures such as MIXED, whether or not to account for heterogeneity or covariates. Data as available from HapMap in compressed format can be obtained through a short DATA step. On the other hand, it would not be so difficult though not as straightforward to wrap up our algorithms via SAS/IML, an interactive matrix language, to process data without apparent use of the extra storage. We also use SAS in conjunction with independent programs as are listed at the Rockefeller linkage server (http://linkage.rockefeller.edu), for it provides the much needed validation and it would generally be a slow process to tune the comprehensive procedures for large data but to fast implement specific tasks, e.g., PLINK (Purcell, et al., 2007) in a similar spirit and the C++ programming language, although it might be unnecessary or a matter of convenience with its feature to split tasks such as frequency calculation and association modeling. In an ongoing study of lung function for a consortium involves 38 analyses at the first phase running SNPTEST on imputed genotypes produced by IMPUTE (Marchini, et al., 2007), all input files were orchestrated by SAS. A collection of R packages is under development (Zhao and Tan, 2006b, and for GWASs these include SNPassoc, GenABEL, pbatR and snpMatrix (Gonzalez, et al., 2007; Aulchenko, et al., 2007; Clayton and Leung, 2007), all available from the Comprehensive R Archive Network (http:///cran.r-project.org) and/or BioConductor (http://www.bioconductor.org). They customarily couple R language with calls to C/C++/Fortran routines or independent programs. Since a data object in R can often be directly treated as a matrix, it renders flexible specification of data in statistical modeling. Moreover, R has been our software of choice for graphical presentation. Epidemiological cohorts eventually play a big role in gene characterization and discoveries (Longholz, et al., 1999; Bodmer and Bonilla, 2008; Manolio, 2009). Our work is a good example of how to reframe a problem in order to arrive at a better solution. We have only covered elementary analysis with a lot of non-genetic variables and a naïve method for data compression, but for sequence data it would be worthwhile to consider better approaches (e.g., Christley, et al., 2009) or faster algorithms (e.g., Wigginton, et al., 2005 ). If general software systems themselves could go beyond comprehensiveness and correctness to be fast, they would be more in line with the pace of GWASs or studies with data of similar kind.