GP-ML-DC: An Ensemble Machine Learning-Based Genomic Prediction Approach with Automated Two-Phase Dimensionality Reduction via Divide-and-Conquer Techniques

Quanzhong Liu,Haofeng Ma,Zhuangbiao Zhang,Zhunhao Hu,Xihong Wang,Ran Li,Yudong Cai,Yu Jiang
DOI: https://doi.org/10.1101/2024.12.26.630443
2024-12-26
Abstract:Traditional machine learning (ML) and deep learning (DL) methods for genome prediction often face challenges due to the imbalance between the limited number of samples ( ) and the large number of single nucleotide polymorphisms (SNPs) ( ), where is much smaller than . To address this, we propose GP-ML-DC, an innovative genome predictor that combines traditional ML and DL models with a unique two-phase, parameter-free dimensionality reduction technique. Initially, GP-ML-DC reduces feature dimensionality by characterizing genes as features. Building on big data methodologies, it employs a divide-and-conquer approach to segment gene regions into multiple haplotypes, further decreasing dimensionality. Each haplotype segment is processed by a sub-task based on traditional ML, followed by integration via a neural network that synthesizes the results of all sub-tasks. Our experiments, conducted on four cattle milk-related traits using ten-fold cross-validation and independent testing, show that GP-ML-DC significantly surpasses current state-of-the-art genome predictors in prediction performance.
Bioinformatics
What problem does this paper attempt to address?