Decision Tree Classifier Makes Genotyping More Intuitive and More Efficient
H. Lee,B. Wang,X. Wu,H. Zhang,F. Xu
DOI: https://doi.org/10.1111/j.1399-0039.2012.01901.x
2012-01-01
Tissue Antigens
Abstract:With the fast development of life science research, a large number of species have been sequenced and more genomic sequence variants have been discovered, which leads to an increasing demand of genotyping for functional DNA sequences research. Genotyping refers to the process of determining the genetic constitution of an individual by examining its DNA sequence with a biological assay. The majority of assays require three basic steps: polymerase chain reaction (PCR), allele discrimination and allele detection. There are four popular allele discrimination methods: primer extension, hybridization, ligation and enzymatic cleavage. The major methods of allele detection include mass spectrometry, fluorescence and chemiluminescence 1. The human major histocompatibility complex is a genomic region on chromosome 6p21.3, and highly polymorphic human leukocyte antigen (HLA) genes in this locus perform the crucial function of antigen presentation. Medical research has found that more precise HLA matching between donor and recipient reduces immunological complications, and increases the survival rate in transplantation, particular for bone marrow transplantation 2. In addition, HLA typing results can also be used as important forensic evidence in paternity identification and criminal identification 3. PCR-SSP (sequence-specific primer) and PCR-SBT (sequence-based typing) are the most commonly used genotyping methods in this locus at present. Compared to PCR-SSP, PCR-SBT can genotype with high-resolution and can be used for discovering new alleles by sequences alignment. However, conventional PCR-SBT method has limited capability for resolving sequences of heterozygous samples in diploid genomes, resulting in ambiguous genotyping results. Some special sequencing methods such as pyrosequencing 4 and sequencing after cloning 5 can effectively solve the problem. A single-nucleotide polymorphism (SNP) is a base substitution of one nucleotide with another nucleotide (A, T, C or G). Genotypes of samples can be determined by detecting a set of SNPs, so genotyping based on SNPs is widely used. With the development of artificial intelligence method, many classification algorithms such as Artificial Neural Networks 6, Support Vector Machine 7 and Decision Tree Classifier (DTC; 8-11), have been applied in biomedical data mining domain. DTC is a nonparametric classification method, which contains a multistage approach of breaking up a complex decision into a union of several simpler decisions. In some representative DTC algorithms, ID3 is widely used for its discrete attributes, while C4.5 is used for both discrete and continuous attributes. The faster C5.0 algorithm based on binary tree gradually made DTC more specific 12-15. Figure 1A shows an example of one genotyping DTC building: multiple sequence alignment (MSA) helps extracting the feature SNPs from labeled sequences, then decision tree can be built up to make genotype classification. For complex structure such as HLA genes which comprise hundreds of SNPs, they also can be used to build genotyping DTC. The ID3 algorithm was first brought forward by Quinlan, which chooses minimal entropy as the selection criterion of decision attribute. In this study, ID3 algorithm was used to make DTC. In order to apply algorithm of ID3 to make decision tree, node structure is constructed which contain attributes of (A, T, C, G and ‘-’) the column position in MSA file of SNP and the level of node. When recursively called ID3 procedure on each node, the complete decision tree would be generated. A package based on MATLAB platform was developed [http://code.google.com/p/dtbg; Genotyping Decision Tree Classifier Builder (GDTCB)]. It has three main functions, which are extracting feature SNPs from the MSA file of training allelic variants, building DTC with feature SNPs, and the identification of the allelic variant by comparing genotype identity in SNPs of target sequence with DTC from root node to leaves. Before genotyping, heterozygous sequence should be transformed into homozygous sequence pairs; a program in the package can help performing this function. The base types of feature SNPs in target sequence can be obtained by pairwise alignment of the target sequence and the reference consensus sequence, which was generated by MSA of HLA allelic variants database. Tree structure of genotyping DTC is saved in MATLAB's MAT format, so it can be directly loaded and viewed in MATLAB workspace. A practical genotyping decision tree structure drawing program was also developed, which can depict decision tree in detail in AUTOCAD's standard file format DWG. The drawing process and the final drawing result can be found in the supplement files S1, S2, S3. The EBI IMGT/HLA database (http://www.ebi.ac.uk/imgt/hla/) provides the latest HLA genotyping sequences. We downloaded MSA files (with the filename extension ‘.msf’) of some highly polymorphic HLA class I genes HLA-A, HLA-B and HLA-C from the database (ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/, release 3.6.0). After feature SNPs had been extracted from MSA files, we trained genotyping decision tree with SNPs data. Then we can use the decision tree to make sequence genotyping. The decision trees and the parameters for three genes are shown in Table 1, and the structure for gene HLA-C genotyping is shown on Figure 1B. For each typing, DTC only make X (X = depth × K, node branch number K is between 2 and 5) comparisons at most, so the typing efficiency is very high. In addition, some hidden information which included in the relationship between alleles, and differences in SNPs between alleles can be excavated and included in the decision tree. In conclusion, DTC make genotyping more intuitive and more efficient. DNA sequencing based genotyping is undoubtedly the most accurate method, particular for polymorphism site enriched sequences. But when used for frequent genotyping analysis, it is hard to improve efficiency due to complex sequencing result. In this study, the known subtypes of sequences are used to extract polymorphism sites, to train the decision tree and to construct the decision tree. As a result, a small number of most informative SNPs were included in the decision tree to make genotyping. Using this method, decision trees related to HLA-A, HLA-B and HLA-C genes are constructed; they possess the ability of high-resolution genotype, and can be used for new genotype discovery. This method can be used to decrease biology experiment requirements, due to only the SNPs included in the decision tree are needed to be tested by experiments. This method also can be used for other genomic sequence research to reduce the genotyping experiments. Furthermore, it can assist other genotyping required applications such as organ transplantation, paternity testing or DNA evidence verification. This study was supported by the National Natural Science Foundation of China (No. 81102085/H2401, 81071299/H1103), Scientific Research Foundation of Shaanxi Provincial Office of Health, P.R. China (No. 2010E06, 2010D21). The authors have declared no conflicting interests. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.