Abstract:The rapid advancement of high-throughput sequencing technologies has revolutionised genomic research by providing access to large amounts of genomic data. However, the most important disadvantage of using Whole Genome Sequencing (WGS) data is its statistical nature, the so-called p>>n problem. This study aimed to compare three approaches of feature selection allowing for circumventing the p>>n problem, among which one is a novel modification of Supervised Rank Aggregation (SRA). The use of the three methods was demonstrated in the classification of 1,825 individuals representing the 1000 Bull Genomes Project to 5 breeds, based on 11,915,233 SNP genotypes from WGS. In the first step, we applied three feature (i.e. SNP) selection methods: the mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA) that was originally proposed in this study. Next, we perform the classification based on a Deep Learning architecture composed of Convolutional Neural Networks. The classification quality of the test data set was expressed by macro F1-Score. The SNPs selected by SNP tagging yielded the least satisfactory results (86.87%). Still, this approach offered rapid computing times by focussing only on pairwise LD between SNPs and disregarding the effects of SNP on classification. 1D-SRA was less suitable for ultra-high-dimensional applications due to computational, memory and storage limitations, however, the SNP set selected by this approach provided the best classification quality (96.81%). MD-SRA provided a very good balance between classification quality (95.12%) and computational efficiency (17x lower analysis time and 14x lower data storage), outperforming other methods. Moreover, unlike SNP tagging, both SRA-based approaches are universal and not limited to feature selection for genomic data. Our work addresses the urgent need for computational techniques that are both effective and efficient in the analysis and interpretation of large-scale genomic datasets. We offer a model suitable for the classification of ultra-high-dimensional data that implements fusing feature selection and deep learning techniques.

Intrinsic-Dimension analysis for guiding dimensionality reduction and data-fusion in multi-omics data processing

Dealing with dimensionality: the application of machine learning to multi-omics data

Supervised dimensionality reduction for big data

Multi-Sensor Fusion via Reduction of Dimensionality

Applications and Comparison of Dimensionality Reduction Methods for Microbiome Data

Ten quick tips for effective dimensionality reduction

Simultaneous Dimensionality Reduction for Extracting Useful Representations of Large Empirical Multimodal Datasets

Visualizing dimensionality reduction of systems biology data

A multivariate approach to the integration of multi-omics datasets

Nonlinear Dimensionality Reduction in the Analysis of High Dimensional Medical Data

A primer on correlation-based dimension reduction methods for multi-omics analysis

Approaches to dimensionality reduction for ultra-high dimensional models

Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

Towards a Comprehensive Evaluation of Dimension Reduction Methods for Transcriptomic Data Visualization

A quantitative framework for evaluating single-cell data structure preservation by dimensionality reduction techniques

Multi-omics data integration and analysis pipeline for precision medicine: Systematic review

Enhancing Dimension-Reduced Scatter Plots with Class and Feature Centroids

Fast Dimension Reduction and Integrative Clustering of Multi-Omics Data Using Low-Rank Approximation: Application to Cancer Molecular Classification

Feature dimensionality reduction: a review

MD3F: Multivariate Distance Drift Diffusion Framework for High-Dimensional Datasets

Simultaneous Dimensionality Reduction: A Data Efficient Approach for Multimodal Representations Learning