Approaches to dimensionality reduction for ultra-high dimensional models

Krzysztof Kotlarz,Dawid Slomian,Joanna Szyda
DOI: https://doi.org/10.1101/2024.08.20.608783
2024-08-20
Abstract:The rapid advancement of high-throughput sequencing technologies has revolutionised genomic research by providing access to large amounts of genomic data. However, the most important disadvantage of using Whole Genome Sequencing (WGS) data is its statistical nature, the so-called p>>n problem. This study aimed to compare three approaches of feature selection allowing for circumventing the p>>n problem, among which one is a novel modification of Supervised Rank Aggregation (SRA). The use of the three methods was demonstrated in the classification of 1,825 individuals representing the 1000 Bull Genomes Project to 5 breeds, based on 11,915,233 SNP genotypes from WGS. In the first step, we applied three feature (i.e. SNP) selection methods: the mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA) that was originally proposed in this study. Next, we perform the classification based on a Deep Learning architecture composed of Convolutional Neural Networks. The classification quality of the test data set was expressed by macro F1-Score. The SNPs selected by SNP tagging yielded the least satisfactory results (86.87%). Still, this approach offered rapid computing times by focussing only on pairwise LD between SNPs and disregarding the effects of SNP on classification. 1D-SRA was less suitable for ultra-high-dimensional applications due to computational, memory and storage limitations, however, the SNP set selected by this approach provided the best classification quality (96.81%). MD-SRA provided a very good balance between classification quality (95.12%) and computational efficiency (17x lower analysis time and 14x lower data storage), outperforming other methods. Moreover, unlike SNP tagging, both SRA-based approaches are universal and not limited to feature selection for genomic data. Our work addresses the urgent need for computational techniques that are both effective and efficient in the analysis and interpretation of large-scale genomic datasets. We offer a model suitable for the classification of ultra-high-dimensional data that implements fusing feature selection and deep learning techniques.
Bioinformatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the statistical challenges of whole - genome sequencing (WGS) data brought by high - throughput sequencing technology, especially the so - called " \( p \gg n \)" problem. Here, \( p \) refers to the number of features (such as single - nucleotide polymorphisms, SNP), and \( n \) refers to the number of samples. When \( p \) is much larger than \( n \), standard statistical methods are difficult to effectively estimate model parameters, resulting in inaccurate statistical and biological inferences and being prone to false - positive associations. In addition, such high - dimensional data also brings great challenges to storage, processing and analysis. To address these challenges, the paper compared three feature selection methods, aiming to find a method that can effectively reduce the number of features while maintaining classification performance. These three methods are: 1. **SNP tagging based on linkage disequilibrium**: This is a method that only considers reducing the correlation between SNPs without considering their biological background. TagSNPs representing specific genomic regions are selected by PLINK software, and the selection is based on local linkage disequilibrium (LD). 2. **One - dimensional supervised rank aggregation (1D - SRA)**: This method combines biological and statistical backgrounds. It evaluates the importance of SNPs for classification by fitting a multi - class logistic regression model. Then, linear mixed model (LMM) is used for rank aggregation, and finally SNPs are divided into two groups, relevant and irrelevant, by 1D K - means clustering. 3. **Multi - dimensional supervised rank aggregation (MD - SRA)**: This is a new method proposed in this paper, aiming to reduce the computational complexity of 1D - SRA. It directly performs multi - dimensional K - means clustering on multiple model performance matrices instead of using LMM for rank aggregation. The paper applies these three methods to 1,825 individuals (including 11,915,233 SNPs) in the 1000 - Bull Genomes Project and uses a convolutional neural network (CNN) for multi - class classification to evaluate the performance of different feature selection methods. The main evaluation metrics include Macro F1 - Score and the area under the receiver operating characteristic curve (AUC). Through these methods, the paper aims to provide an effective solution for feature selection of ultra - high - dimensional data while maintaining computational efficiency and classification performance.