MOSAIC: A Pipeline for MicrobiOme Studies Analytical Integration and Correction

Chenlian Fu,Jiuyao Lu,Ni Zhao,Wodan Ling
DOI: https://doi.org/10.1101/2024.11.07.622561
2024-11-11
Abstract:Large-scale and consortium microbiome studies have enabled identification of reliable population-level biomedical signals, wherein integration is essential to eliminate unwanted variations between batches or studies and retain biological signals. Many strategies, each with distinct advantages and limitations, have been adapted or developed for microbiome data. The optimal strategy for a given study needs to be determined on a data-specific, case-by-case basis. Here, we develop the first-of-its-kind MicrobiOme Studies Analytical Integration and Correction (MOSAIC) pipeline to enable a convenient, fair, and comprehensive comparison of integration strategies. It includes modules for pre-processing, integration, and evaluation of artifact removal and signal preservation, using metrics relevant to common microbiome analyses, including alpha and beta diversities, disease prediction, and differential abundance analysis. We applied MOSAIC to extensive real-world and simulated data and found that though no single strategy excels in all aspects, yet certain strategies, the ComBat and ConQuR families, perform better overall.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively integrate data from different batches or studies in large - scale and consortium microbiome studies, so as to eliminate batch effects while retaining biological signals. Specifically, the paper points out that when dealing with large - scale microbiome data sets, due to differences in sample collection and processing, as well as the use of different study designs, experimental protocols and technologies, batch effects or heterogeneity between studies will occur in the data. These unwanted variations will mask the true biological signals and lead to false research findings. Therefore, choosing an appropriate integration strategy is crucial for improving statistical power and enhancing statistical robustness. To solve this problem, the authors developed a pipeline named MOSAIC (MicrobiOme Studies Analytical Integration and Correction), which is the first comprehensive pipeline specifically designed for the integration and correction of microbiome data analysis. MOSAIC aims to provide a convenient, fair and comprehensive comparison framework for evaluating different data integration strategies, thereby helping researchers select the best integration method according to their specific data and analysis goals. MOSAIC includes three modules: pre - processing, integration and evaluation, and can conduct a comprehensive performance evaluation of the integrated data, including but not limited to α - diversity, β - diversity, disease prediction ability and differential abundance analysis. By applying MOSAIC to a wide range of real - world and simulated data, the authors found that although no single method can perform excellently on all evaluation criteria, some strategies such as the ComBat and ConQuR families generally perform better. This indicates that MOSAIC provides a powerful tool and support for data integration in microbiome research.