Enabling pan-repository reanalysis for big data science of public metabolomics data

Yasin El Abiead,Michael Strobel,Thomas Payne,Eoin Fahy,Claire O’Donovan,Shankar Subramamiam,Juan Antonio Vizcaino,Simone Zuffa,Shipei Xing,Helena Mannochio-Russo,Ipsita Mohanty,Haoqi Nina Zhao,Andres Mauricio Caraballo-Rodriguez,Paulo Wender Portal Gomes,Nicole Elizabeth Avalon,Pieter C Dorrestein,Mingxun Wang
DOI: https://doi.org/10.26434/chemrxiv-2024-jt46s
2024-04-16
Abstract:Public untargeted metabolomics data is a growing resource for metabolite and phenotype discovery; however, accessing and utilizing these data across repositories pose significant challenges. Therefore, we've developed pan-repository universal identifiers and harmonized cross-repository metadata. This novel ecosystem facilitates discovery by integrating diverse data sources from public repositories including MetaboLights, Metabolomics Workbench, and GNPS/MassIVE. Our approach simplifies data handling and unlocks previously inaccessible reanalysis workflows, fostering unmatched research opportunities.
Chemistry
What problem does this paper attempt to address?
The paper aims to address the challenges faced in the reanalysis of public metabolomics data across repositories. Specifically, the authors have developed a system called Pan-ReDU to tackle the following key issues: 1. **Cross-repository data integration**: With the growth of public metabolomics data, effectively accessing and utilizing data dispersed across different repositories has become a challenge. Existing data repositories such as MetaboLights (MTBLS), the National Metabolomics Data Repository (NMDR) of Metabolomics Workbench, and GNPS/MassIVE hold vast amounts of metabolomics data, but there is a lack of effective integration between these datasets. 2. **Metadata standardization and interoperability**: Different data repositories adopt different metadata standards, making it difficult to search for relevant raw data across repositories. For example, MTBLS uses the standardized ISA model, NMDR is based on the standards recommended by the Metabolomics Standard Initiative, and GNPS allows users to submit custom metadata. 3. **Simplification of data processing workflows**: For each project, tasks such as collecting and standardizing metadata, indexing file paths, converting data formats, and integrating multiple data processing pipelines require significant effort and expertise. To overcome these challenges, the authors developed the Pan-ReDU system, which includes a set of Python tools, reusable Nextflow workflows, application programming interfaces (APIs), and a user interface for data integration and reanalysis across the three major repositories: MTBLS, NMDR, and GNPS/MassIVE. By using a unified metadata format and standard mechanisms (such as MS Run Identifier, MRI), Pan-ReDU facilitates the discovery, access, interoperability, and reuse of data, thereby enabling the effective utilization of large amounts of public metabolomics data. Additionally, Pan-ReDU enhances the integration of data analysis tools, further lowering the barrier for researchers to utilize these data for their studies.