Abstract:The rapid advancement of high-throughput sequencing technologies has revolutionised genomic research by providing access to large amounts of genomic data. However, the most important disadvantage of using Whole Genome Sequencing (WGS) data is its statistical nature, the so-called p>>n problem. This study aimed to compare three approaches of feature selection allowing for circumventing the p>>n problem, among which one is a novel modification of Supervised Rank Aggregation (SRA). The use of the three methods was demonstrated in the classification of 1,825 individuals representing the 1000 Bull Genomes Project to 5 breeds, based on 11,915,233 SNP genotypes from WGS. In the first step, we applied three feature (i.e. SNP) selection methods: the mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA) that was originally proposed in this study. Next, we perform the classification based on a Deep Learning architecture composed of Convolutional Neural Networks. The classification quality of the test data set was expressed by macro F1-Score. The SNPs selected by SNP tagging yielded the least satisfactory results (86.87%). Still, this approach offered rapid computing times by focussing only on pairwise LD between SNPs and disregarding the effects of SNP on classification. 1D-SRA was less suitable for ultra-high-dimensional applications due to computational, memory and storage limitations, however, the SNP set selected by this approach provided the best classification quality (96.81%). MD-SRA provided a very good balance between classification quality (95.12%) and computational efficiency (17x lower analysis time and 14x lower data storage), outperforming other methods. Moreover, unlike SNP tagging, both SRA-based approaches are universal and not limited to feature selection for genomic data. Our work addresses the urgent need for computational techniques that are both effective and efficient in the analysis and interpretation of large-scale genomic datasets. We offer a model suitable for the classification of ultra-high-dimensional data that implements fusing feature selection and deep learning techniques.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the statistical challenges of whole - genome sequencing (WGS) data brought by high - throughput sequencing technology, especially the so - called " \( p \gg n \)" problem. Here, \( p \) refers to the number of features (such as single - nucleotide polymorphisms, SNP), and \( n \) refers to the number of samples. When \( p \) is much larger than \( n \), standard statistical methods are difficult to effectively estimate model parameters, resulting in inaccurate statistical and biological inferences and being prone to false - positive associations. In addition, such high - dimensional data also brings great challenges to storage, processing and analysis. To address these challenges, the paper compared three feature selection methods, aiming to find a method that can effectively reduce the number of features while maintaining classification performance. These three methods are: 1. **SNP tagging based on linkage disequilibrium**: This is a method that only considers reducing the correlation between SNPs without considering their biological background. TagSNPs representing specific genomic regions are selected by PLINK software, and the selection is based on local linkage disequilibrium (LD). 2. **One - dimensional supervised rank aggregation (1D - SRA)**: This method combines biological and statistical backgrounds. It evaluates the importance of SNPs for classification by fitting a multi - class logistic regression model. Then, linear mixed model (LMM) is used for rank aggregation, and finally SNPs are divided into two groups, relevant and irrelevant, by 1D K - means clustering. 3. **Multi - dimensional supervised rank aggregation (MD - SRA)**: This is a new method proposed in this paper, aiming to reduce the computational complexity of 1D - SRA. It directly performs multi - dimensional K - means clustering on multiple model performance matrices instead of using LMM for rank aggregation. The paper applies these three methods to 1,825 individuals (including 11,915,233 SNPs) in the 1000 - Bull Genomes Project and uses a convolutional neural network (CNN) for multi - class classification to evaluate the performance of different feature selection methods. The main evaluation metrics include Macro F1 - Score and the area under the receiver operating characteristic curve (AUC). Through these methods, the paper aims to provide an effective solution for feature selection of ultra - high - dimensional data while maintaining computational efficiency and classification performance.

Approaches to dimensionality reduction for ultra-high dimensional models

Application of a genomic model for high-dimensional chemometric analysis

An Efficient Sufficient Dimension Reduction Method for Identifying Genetic Variants of Clinical Significance

A Non-Parametric Method for Building Predictive Genetic Tests on High-Dimensional Data

Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset

Supervised Learning-Based Tagsnp Selection for Genome-Wide Disease Classifications

Large-scale Genotyping of Complex DNA

Deep Learning for Efficient GWAS Feature Selection

High dimensional surrogacy: computational aspects of an upscaled analysis

Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Supervised dimensionality reduction for big data

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

Genome-wide association studies with high-dimensional phenotypes

Learning the optimal scale for GWAS through hierarchical SNP aggregation

A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis

Enhancing genotype-phenotype association with optimized machine learning and biological enrichment methods

LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK

A Novel Feature Selection Method Based on MRMR and Enhanced Flower Pollination Algorithm for High Dimensional Biomedical Data

High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression

A Novel Feature Selection Method for High-Dimensional Biomedical Data Based on an Improved Binary Clonal Flower Pollination Algorithm