Abstract:The rapid advancement of high-throughput sequencing technologies has revolutionised genomic research by providing access to large amounts of genomic data. However, the most important disadvantage of using Whole Genome Sequencing (WGS) data is its statistical nature, the so-called p>>n problem. This study aimed to compare three approaches of feature selection allowing for circumventing the p>>n problem, among which one is a novel modification of Supervised Rank Aggregation (SRA). The use of the three methods was demonstrated in the classification of 1,825 individuals representing the 1000 Bull Genomes Project to 5 breeds, based on 11,915,233 SNP genotypes from WGS. In the first step, we applied three feature (i.e. SNP) selection methods: the mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA) that was originally proposed in this study. Next, we perform the classification based on a Deep Learning architecture composed of Convolutional Neural Networks. The classification quality of the test data set was expressed by macro F1-Score. The SNPs selected by SNP tagging yielded the least satisfactory results (86.87%). Still, this approach offered rapid computing times by focussing only on pairwise LD between SNPs and disregarding the effects of SNP on classification. 1D-SRA was less suitable for ultra-high-dimensional applications due to computational, memory and storage limitations, however, the SNP set selected by this approach provided the best classification quality (96.81%). MD-SRA provided a very good balance between classification quality (95.12%) and computational efficiency (17x lower analysis time and 14x lower data storage), outperforming other methods. Moreover, unlike SNP tagging, both SRA-based approaches are universal and not limited to feature selection for genomic data. Our work addresses the urgent need for computational techniques that are both effective and efficient in the analysis and interpretation of large-scale genomic datasets. We offer a model suitable for the classification of ultra-high-dimensional data that implements fusing feature selection and deep learning techniques.

An Efficient Sufficient Dimension Reduction Method for Identifying Genetic Variants of Clinical Significance

A Non-Parametric Method for Building Predictive Genetic Tests on High-Dimensional Data

Approaches to dimensionality reduction for ultra-high dimensional models

Identifying Disease-Associated Snp Clusters Via Contiguous Outlier Detection

Beyond guilty by association at scale: searching for causal variants on the basis of genome-wide summary statistics

LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK

Is Seeing Believing? A Practitioner's Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies

Group sparse sufficient dimension reduction: a model-free group variable selection method

Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Supervised dimensionality reduction for big data

Implication of Next-Generation Sequencing on Association Studies

Sufficient Direction Factor Model and Its Application to Gene Expression Quantitative Trait Loci Discovery

Binary and Re-search Signal Region Detection in High Dimensions

Integrating Multiple Genomic Data to Predict Disease-Causing Nonsynonymous Single Nucleotide Variants in Exome Sequencing Studies

Dimension Reduction using Local Principal Components for Regression-based Multi-SNP Analysis in 1000 Genomes and the Canadian Longitudinal Study on Aging (CLSA)

Identifying Genetic Risk Factors via Sparse Group Lasso with Group Graph Structure

A New Statistical Framework for Genetic Pleiotropic Analysis of High Dimensional Phenotype Data

Identification of genetic basis of brain imaging by group sparse multi-task learning leveraging summary statistics

Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole genome sequencing studies

Assessing the function of genetic variants in candidate gene association studies

A Method for Predicting Allelic Variants of Single Nucleotide Polymorphisms