MOCAT: multi-omics integration with auxiliary classifiers enhanced autoencoder

Xiaohui Yao,Xiaohan Jiang,Haoran Luo,Hong Liang,Xiufen Ye,Yanhui Wei,Shan Cong
DOI: https://doi.org/10.1186/s13040-024-00360-6
2024-03-07
BioData Mining
Abstract:Integrating multi-omics data is emerging as a critical approach in enhancing our understanding of complex diseases. Innovative computational methods capable of managing high-dimensional and heterogeneous datasets are required to unlock the full potential of such rich and diverse data.
mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in multi - omics data integration and complex disease classification prediction. Specifically, the authors aim to overcome the limitations of existing methods in handling high - dimensional, heterogeneous multi - omics data by developing a new computational framework, thereby improving the understanding of complex diseases, classification accuracy, and biomarker discovery. ### Specific background of the problem In recent years, the development of omics technologies has enabled large - scale data acquisition at multiple biological levels (such as genomics, transcriptomics, proteomics, metabolomics, etc.). Each type of omics data provides different levels of biological information, so data integration has become an effective tool in multi - omics research, which not only helps to comprehensively understand complex biological phenomena but also significantly enhances the ability to analyze disease mechanisms and identify disease biomarkers. However, the main challenges in multi - omics data analysis include: 1. **High - dimensionality and heterogeneity**: Multi - omics data are characterized by high - dimensionality and heterogeneity, which pose great challenges to data integration and downstream tasks. 2. **Complexity of feature extraction**: Traditional statistical methods often require human intervention in feature extraction and it is difficult to explain the importance of global features. 3. **Model over - fitting**: High - dimensional data are prone to cause model over - fitting and reduce prediction accuracy. 4. **Classifier confidence calibration**: The traditional maximum class probability (MCP) method is prone to be over - confident in wrong predictions when evaluating prediction confidence. ### The method proposed in the paper To solve the above problems, the authors proposed a framework named MOCAT (Multi - Omics integration with auxiliary Classifiers - enhanced AuToencoders), which mainly contains the following key components: 1. **omics - specific feature extraction**: Use auto - encoder networks to reduce the dimension of high - dimensional data, extract representative features, and enhance feature representation through auxiliary classifiers. 2. **cross - omics fusion and trustworthy prediction**: Utilize the attention mechanism to fuse data from different omics, and introduce a credibility evaluation mechanism to improve the confidence calibration of classifiers. 3. **biomarker identification**: Identify important biomarkers through feature ablation analysis and evaluate their interactions among different omics. ### Main contributions - **omics - specific feature optimization**: Introduce auxiliary classifiers specific to each type of omics data to identify biomarkers most relevant to disease states, significantly improving the quality of feature representation. - **enhanced classifier confidence calibration**: Adopt the true class probability (TCP) criterion to adjust the confidence of classifiers, avoid over - fitting and improve prediction accuracy. - **explainability**: Provide transparency in the model decision - making process through an integrated explanation mechanism, promoting biomarker discovery and understanding. - **state - of - the - art (SOTA) performance**: Verify the superior performance of this framework on four independent datasets, proving that it is superior to existing methods in disease classification tasks. In summary, this paper is committed to solving complex problems in multi - omics data integration through innovative deep - learning methods, thereby improving the accuracy and reliability of disease classification and biomarker discovery.