An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification

Ibrahim Al-Hurani,Abedalrhman Alkhateeb,Salama Ikki
2024-05-16
Abstract:In the relentless efforts in enhancing medical diagnostics, the integration of state-of-the-art machine learning methodologies has emerged as a promising research area. In molecular biology, there has been an explosion of data generated from multi-omics sequencing. The advent sequencing equipment can provide large number of complicated measurements per one experiment. Therefore, traditional statistical methods face challenging tasks when dealing with such high dimensional data. However, most of the information contained in these datasets is redundant or unrelated and can be effectively reduced to significantly fewer variables without losing much information. Dimensionality reduction techniques are mathematical procedures that allow for this reduction; they have largely been developed through statistics and machine learning disciplines. The other challenge in medical datasets is having an imbalanced number of samples in the classes, which leads to biased results in machine learning models. This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features, and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent space is the reduced dimensional space that captures the meaningful features of the original data. Our model starts with feature selection to select the discriminative features before feeding them to the neural network. Then, the model predicts the outcome of cancer for different datasets. The proposed model outperformed other existing models by scoring accuracy of 95.09% for bladder cancer dataset and 88.82% for the breast cancer dataset.
Machine Learning,Neural and Evolutionary Computing,Genomics
What problem does this paper attempt to address?
The paper aims to address two major issues faced by multi-omics data in cancer prediction: 1. **Dimensionality Reduction of High-Dimensional Data**: With the development of next-generation sequencing technology, multi-omics data has shown explosive growth, making it difficult for traditional statistical methods to handle such high-dimensional data. The paper proposes using autoencoders for feature extraction, thereby transforming the raw data into a low-dimensional space, retaining key information while reducing redundant features. 2. **Class Imbalance Problem**: Another common issue in medical datasets is the uneven distribution of sample classes, where the number of majority class samples greatly exceeds that of minority class samples, causing machine learning models to be biased towards the majority class, resulting in unreliable outcomes. To address this issue, the paper introduces Generative Adversarial Networks (GANs) to generate synthetic samples, thereby increasing the number of minority class samples and balancing the dataset. Through the aforementioned methods, the authors constructed a model capable of effectively handling multi-omics data and improving the accuracy of cancer classification. The model was validated on breast cancer (BRCA) and bladder cancer (BLCA) datasets, achieving results significantly better than existing methods.