Abstract:Genotype to phenotype prediction is a central problem in biology and medicine. Machine learning is a natural tool to address this problem. However, a person’s genotype is usually represented by a few million single-nucleotide polymorphisms and most datasets only have a few thousand patients. Thus, this problem typically has many more predictors than the number of samples (patients), making it unsuitable for machine learning. The objective of this paper is to examine the efficacy of a compact genotype representation, which employs a limited number of predictors, in predicting a person’s phenotype through the application of machine learning. We characterized a person’s genotype using chromosome-scale length variation, a measure that is computed as the average value of reported log R ratios across a portion of a chromosome. We computed these numbers from data collected by the NIH All of Us program. We used the AutoML function (h2o.ai) in binary classification mode to identify the best models to differentiate between male/female, Black/white, white/Asian, and Black/Asian. We also used the AutoML function in regression mode to predict the height of people based on their age and genotype. Our results showed that we could effectively classify a person, using only information from chromosomes 1-22, as Male/Female (AUC=0.9988±0.0001), White/Black (AUC=0.970± 0.002), Asian/White (AUC=0.877± 0.002), and Black/Asian (AUC=0.966± 0.002). This approach also effectively predicted height. In conclusion, we have shown that this compact representation of a person’s genotype, along with machine learning, can effectively predict a person’s phenotype.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is the genotype - to - phenotype prediction problem, which is a central issue in biology and medicine. Specifically, the authors hope to find an effective method to utilize machine - learning techniques to predict phenotypic characteristics (such as sex, race, and height) from genotype data (i.e., an individual's genetic information). The following are the main objectives and background of this study: ### Research Background 1. **Challenges in Genotype - to - Phenotype Prediction**: - Traditional methods usually focus on the functional changes of a single gene and its protein, but this approach ignores the impact of multiple gene - variant combinations on the phenotype. - The amount of genomic data is huge, while the number of samples is relatively small, resulting in the "large p, small n" problem (i.e., the number of predictor variables \( p \) is much larger than the number of samples \( n \)), making it difficult to directly apply machine learning. 2. **Limitations of Existing Methods**: - Single - nucleotide polymorphisms (SNP) are a commonly used data source, but each individual may have millions of SNPs, while a typical dataset contains only a few thousand samples. - This data imbalance makes it difficult for traditional machine - learning models to be effectively trained and generalized. ### Research Objectives 1. **Propose a Compact Genotype Representation Method**: - The authors introduce a new genotype representation method, called Chromosome - scale Length Variation (CSLV), which represents the genotype by calculating the average log R ratio (\( \text{log}R \) ratio) on chromosome segments. - This method simplifies the genotype into fewer predictor variables (e.g., 88 CSLV values), thus solving the "large p, small n" problem. 2. **Verify the Effectiveness of the New Method**: - Use the genetic data from the NIH All of Us project to perform classification and regression tasks through machine - learning models (such as H2O AutoML) to evaluate the performance of the CSLV representation in predicting sex, race, and height. - Specific experiments include binary - classification tasks (such as distinguishing between male and female, between different races) and regression tasks (such as predicting height). ### Main Contributions - **Efficient Prediction**: The study shows that using the CSLV representation method and machine - learning models can effectively perform phenotypic prediction, especially performing excellently in sex classification (AUC = 0.9988±0.0001) and race classification (AUC ranging from 0.877 to 0.970). - **Highly Predictive**: Good results have also been achieved in predicting height, indicating that this method has broad application potential. In conclusion, this paper successfully solves the key problems in genotype - to - phenotype prediction by introducing a compact genotype representation method and combining it with machine - learning techniques, demonstrating its potential application value in the biomedical field.

A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores

Deep Learning-Derived 12-Lead Electrocardiogram-Based Genotype Prediction for Hypertrophic Cardiomyopathy: a Pilot Study.

A Probabilistic Model to Predict Clinical Phenotypic Traits from Genome Sequencing

A Novel Approach to Encode Two-Way Epistatic Interactions Between Single Nucleotide Polymorphisms

A Non-Parametric Method for Building Predictive Genetic Tests on High-Dimensional Data

A Machine-Learning Heuristic to Improve Gene Score Prediction of Polygenic Traits

Genetic risk prediction in complex disease

Optimised stacked machine learning algorithms for genomics and genetics disorder detection in the healthcare industry

Effect of genotype imputation on genome-enabled prediction of complex traits: an empirical study with mice data

Computationally efficient whole-genome regression for quantitative and binary traits

Genetic prediction of quantitative traits: a machine learner's guide focused on height

Using Genomic Context Informed Genotype Data and Within‐model Ancestry Adjustment to Classify Type 2 Diabetes

Machine learning model to predict risk assessment of a child inheriting a genetic disorder

A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants

Enhancing genotype-phenotype association with optimized machine learning and biological enrichment methods

Using Machine Learning to Predict Obesity Based on Genome-Wide and Epigenome-Wide Gene–Gene and Gene–Diet Interactions

Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

Machine Learning to Advance Human Genome-Wide Association Studies

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Genetic variability among Schistosoma japonicum isolates from different endemic regions in China revealed by sequences of three mitochondrial DNA genes.

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations