Abstract:Background: Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures. Results: The ML program C4.5 was applied to a set of features in order to build a SNP classifier from training data based on human expert decisions (True/False). The training data were 27,275 candidate SNP generated by sequencing 1973 STS (sequence tag sites) (12 Mb) in both directions from 6 diverse homozygous soybean cultivars and PolyBayes analysis. Test data of 18,390 candidate SNP were generated similarly from 1359 additional STS (8 Mb). SNP from both sets were classified by experts. After training the ML classifier, it agreed with the experts on 97.3% of test data compared with 7.8% agreement between PolyBayes and experts. The PolyBayes positive predictive values (PPV) (i.e., fraction of candidate SNP being real) were 7.8% for all predictions and 16.7% for those with 100% posterior probability of being real. Using ML improved the PPV to 84.8%, a 5- to 10-fold increase. While both ML and PolyBayes produced a similar number of true positives, the ML program generated only 249 false positives as compared to 16,955 for PolyBayes. The complexity of the soybean genome may have contributed to high false SNP predictions by PolyBayes and hence results may differ for other genomes. Conclusion: A machine learning (ML) method was developed as a supplementary feature to the polymorphism detection software for improving prediction accuracies. The results from this study indicate that a trained ML classifier can significantly reduce human intervention and in this case achieved a 5-10 fold enhanced productivity. The optimized feature set and ML framework can also be applied to all polymorphism discovery software. ML support software is written in Perl and can be easily integrated into an existing SNP discovery pipeline.

Reference-free SNP Calling: Improved Accuracy by Preventing Incorrect Calls from Repetitive Genomic Regions

On Combining Reference Data to Improve Imputation Accuracy

Snp Detection for Massively Parallel Whole-Genome Resequencing

A New Genotype Imputation Method with Tolerance to High Missing Rate and Rare Variants

Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering

Genotyping single nucleotide polymorphisms in homologous regions using multiplex kb level amplicon capture sequencing

SNP calling from RNA-seq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence

Application of machine learning in SNP discovery

A New Genotype Calling Method for Affymetrix SNP Arrays.

Kmer2SNP: Reference-Free Heterozygous SNP Calling Using k-mer Frequency Distributions

A general approach to single-nucleotide polymorphism discovery

Parallel Analysis of 124 Universal Snps for Human Identification by Targeted Semiconductor Sequencing

One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies.

Genome Wide Sampling Sequencing for SNP Genotyping: Methods, Challenges and Future Development

Detection of Genomic Variations and DNA Polymorphisms and Impact on Analysis of Meiotic Recombination and Genetic Mapping.

Efficient Frequency-Based De Novo Short-Read Clustering for Error Trimming in Next-Generation Sequencing

Breed-specific reference sequence optimized mapping accuracy of NGS analyses for pigs

Hybridization modeling of oligonucleotide SNP arrays for accurate DNA copy number estimation.

Snpdetector: A Software Tool for Sensitive and Accurate Snp Detection

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

cascAGS: Comparative Analysis of SNP Calling Methods for Human Genome Data in the Absence of Gold Standard