Abstract:Background: Nowadays, gene expression analysis is one of the most promising pillars for understanding and uncovering the mechanisms underlying the development and spread of cancer. In this sense, Next Generation Sequencing technologies, such as RNA-Seq, are currently leading the market due to their precision and cost. Nevertheless, there is still an enormous amount of non-analyzed data obtained from older technologies, such as Microarray, which could still be useful to extract relevant knowledge. Methods: Throughout this research, a complete machine learning methodology to cross-evaluate the compatibility between both RNA-Seq and Microarray sequencing technologies is described and implemented. In order to show a real application of the designed pipeline, a lung cancer case study is addressed by considering two detected subtypes: adenocarcinoma and squamous cell carcinoma. Transcriptomic datasets considered for our study have been obtained from the public repositories NCBI/GEO, ArrayExpress and GDC-Portal. From them, several gene experiments have been carried out with the aim of finding gene signatures for these lung cancer subtypes, linked to both transcriptomic technologies. With these DEGs selected, intelligent predictive models capable of classifying new samples belonging to these cancer subtypes have been developed. Results: The predictive models built using one technology are capable of discerning samples from a different technology. The classification results are evaluated in terms of accuracy, F1-score and ROC curves along with AUC. Finally, the biological information of the gene sets obtained and their relationship with lung cancer are reviewed, encountering strong biological evidence linking them to the disease. Conclusion: Our method has the capability of finding strong gene signatures which are also independent of the transcriptomic technology used to develop the analysis. In addition, our article highlights the potential of using heterogeneous transcriptomic data to increase the amount of samples for the studies, increasing the statistical significance of the results.

A Comparative Analysis of Gene Expression Profiling by Statistical and Machine Learning Approaches

Studying Limits of Explainability by Integrated Gradients for Gene Expression Models

Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile

Feature (gene) Selection in Gene Expression-Based Tumor Classification

Heterogeneous Gene Expression Cross-Evaluation of Robust Biomarkers Using Machine Learning Techniques Applied to Lung Cancer

Analysis of Gene Expression Profiles of Lung Cancer Subtypes with Machine Learning Algorithms.

Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

Classification of human cancer diseases by gene expression profiles

Cancer prediction with gene expression profiling and differential evolution

Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review

Tumor-Specific Gene Expression Patterns With Gene Expression Proffles

Exploring Prognostic Gene Factors in Breast Cancer via Machine Learning

Identification of Gene Expression in Different Stages of Breast Cancer with Machine Learning

Explainable Machine Learning Models Using Robust Cancer Biomarkers Identification from Paired Differential Gene Expression

Tumor-specific Gene Expression Patterns with Gene Expression Profiles

Towards precise classification of cancers based on robust gene functional expression profiles

Comparative Study of Cancer Classification by Analysis of RNA-seq Gene Expression Levels

Subtype Dependent Biomarker Identification and Tumor Classification from Gene Expression Profiles.

Signature Genes Selection and Functional Analysis of Phenotypes: A Comparative Study

Advancing regulatory genomics with machine learning

Identification of Differentially Expressed Genes Between Original Breast Cancer and Xenograft Using Machine Learning Algorithms