Application of the RDF framework to integrate heterogenous experimental data of a large chemo- and biodiverse collection from a research collaborative project

Frédéric Burdet,Luis-Manuel Quiros-Guerrero,Pierre-Marie Allard,Louis-Felix Nothias,Olivier Kirchhoffer,Arnaud Gaudry,Sebastien Moretti,Robin Engler,Emerson Ferreira Queiroz ,Jahn Nitschke,Nabil Hanna,Chunyan Wu,Antonio Grondin,Bruno David,Thierry Soldati,Christian Wolfrum,Erick Carreira,Jean-Luc Wolfender,Marco Pagni,Florence Mehl
DOI: https://doi.org/10.26434/chemrxiv-2023-4hlgd
2023-11-28
Abstract:Plants have a complex chemo-diversity and represent a reservoir of potential new therapeutic agents. Within a Swiss research project, six scientific research groups from different disciplines are collaborating to investigate a collection of more than 17’000 unique dried plant extracts. It aims to find new bioactive molecules and their modes of action, with for example anti-infective or pro-metabolic activities. One of the main challenges of this enterprise is the management, integration and sharing of the highly heterogeneous data that are produced by the different research groups. Among these we find (i) massive high-resolution mass spectrometry data, (ii) the numerical results of innovative chemo-informatics methods, (iii) bioassay results from experimental models of tuberculosis and obesity, and (iv) organic synthetic chemistry. Additionally, requirements for data management plan and open-source science with the FAIR principles must be met. We have established an agile pipeline to capture and structure this heterogeneous data into an RDF graph. The data content's gradual expansion and evolution throughout the project presented considerable challenges, particularly in terms of data modeling. Additionally, despite many collaborators not being RDF experts, most were technically adept at producing RDF triples relevant to their contributions. We have deployed multiple instances of a triplestore and developed an in-house custom tool (i.e. KGSteward) to synchronize their content, based on a configuration file, which is centrally managed and version-controlled using Git. This strategy gave us the flexibility required to address global project challenges in common data management effectively.
Chemistry
What problem does this paper attempt to address?