Abstract:The rapid advancement of high-throughput sequencing technologies has revolutionised genomic research by providing access to large amounts of genomic data. However, the most important disadvantage of using Whole Genome Sequencing (WGS) data is its statistical nature, the so-called p>>n problem. This study aimed to compare three approaches of feature selection allowing for circumventing the p>>n problem, among which one is a novel modification of Supervised Rank Aggregation (SRA). The use of the three methods was demonstrated in the classification of 1,825 individuals representing the 1000 Bull Genomes Project to 5 breeds, based on 11,915,233 SNP genotypes from WGS. In the first step, we applied three feature (i.e. SNP) selection methods: the mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA) that was originally proposed in this study. Next, we perform the classification based on a Deep Learning architecture composed of Convolutional Neural Networks. The classification quality of the test data set was expressed by macro F1-Score. The SNPs selected by SNP tagging yielded the least satisfactory results (86.87%). Still, this approach offered rapid computing times by focussing only on pairwise LD between SNPs and disregarding the effects of SNP on classification. 1D-SRA was less suitable for ultra-high-dimensional applications due to computational, memory and storage limitations, however, the SNP set selected by this approach provided the best classification quality (96.81%). MD-SRA provided a very good balance between classification quality (95.12%) and computational efficiency (17x lower analysis time and 14x lower data storage), outperforming other methods. Moreover, unlike SNP tagging, both SRA-based approaches are universal and not limited to feature selection for genomic data. Our work addresses the urgent need for computational techniques that are both effective and efficient in the analysis and interpretation of large-scale genomic datasets. We offer a model suitable for the classification of ultra-high-dimensional data that implements fusing feature selection and deep learning techniques.

Learning the optimal scale for GWAS through hierarchical SNP aggregation

Principles for the Post-Gwas Functional Characterisation of Risk Loci

Identifying Disease-Associated Snp Clusters Via Contiguous Outlier Detection

Large-scale Genotyping of Complex DNA

HAPRAP: a haplotype-based iterative method for statistical fine mapping using GWAS summary statistics

Identifying Genetic Risk Factors via Sparse Group Lasso with Group Graph Structure

Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Genome-wide association testing beyond SNPs

On Combining Data From Genome-Wide Association Studies to Discover Disease-Associated SNPs

Approaches to dimensionality reduction for ultra-high dimensional models

In search of causal variants: refining disease association signals using cross-population contrasts

Maximal Conditional Chi-Square Importance in Random Forests

Alternative Methods for H1 Simulations in Genome Wide Association Studies

Small-group originating model: Optimized individual-level GWAS simulation featured by SLiM and using open-access data

Genome-wide association studies with high-dimensional phenotypes

Searching Genome-Wide Multi-Locus Associations for Multiple Diseases Based on Bayesian Inference.

GPA: A statistical approach to prioritizing GWAS results by integrating pleiotropy information and annotation data

GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation

Bayesian Hierarchical Hypothesis Testing in Large-Scale Genome-Wide Association Analysis

multi-GPA-Tree: Statistical Approach for Pleiotropy Informed and Functional Annotation Tree Guided Prioritization of GWAS Results

Hierarchical inference for genome-wide association studies: a view on methodology with software