Methods for Linking Data to Online Resources and Ontologies with Applications to Neurophysiology

Matthew Avaylon,Ryan Ly,Andrew Tritt,Benjamin Dichter,Kristofer E. Bouchard,Christopher J. Mungall,Oliver Ruebel
2024-05-30
Abstract:Across many domains, large swaths of digital assets are being stored across distributed data repositories, e.g., the DANDI Archive [8]. The distribution and diversity of these repositories impede researchers from formally defining terminology within experiments, integrating information across datasets, and easily querying, reusing, and analyzing data that follow the FAIR principles [15]. As such, it has become increasingly important to have a standardized method to attach contextual metadata to datasets. Neuroscience is an exemplary use case of this issue due to the complex multimodal nature of experiments. Here, we present the HDMF External Resources Data (HERD) standard and related tools, enabling researchers to annotate new and existing datasets by mapping external references to the data without requiring modification of the original dataset. We integrated HERD closely with Neurodata Without Borders (NWB) [2], a widely used data standard for sharing and storing neurophysiology data. By integrating with NWB, our tools provide neuroscientists with the capability to more easily create and manage neurophysiology data in compliance with controlled sets of terms, enhancing rigor and accuracy of data and facilitating data reuse.
Databases
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to effectively link, manage, and query contextual metadata in neurophysiological data to ensure data reusability, accuracy, and consistency, especially in the case of involving multiple distributed data repositories**. Specifically, the paper proposes solutions to the following problems: 1. **Non - uniform term definitions**: Terms used by different laboratories and researchers may vary, resulting in different expressions for the same concept. For example, when describing a species, "human" or "homo sapiens" may be used, which will increase the difficulty of data sharing and integration. 2. **Lack of a standardized external resource linking mechanism**: Although existing data standards (such as NWB) support data storage and sharing, they lack a mechanism to uniquely identify and link metadata entities to external resources (such as ontologies, brain atlases, etc.), thus affecting data interpretability and reusability. 3. **Efficient management of large - scale data**: With the increase in data volume, how to dynamically add external resource links to existing data without modifying the original data has become an urgent problem to be solved. For this reason, the paper introduces the **HDMF External Resources Data (HERD)** standard and related tools, aiming to: - Provide a standardized method to link data and metadata to external resources without modifying the original data set. - By integrating with the Neurodata Without Borders (NWB) data standard, help neuroscientists more easily create and manage data that conforms to the controlled vocabulary set, enhancing data rigor and accuracy. - Support flexible vocabulary set definition and validation mechanisms to adapt to different experimental stages and requirements. Through these methods, the HERD standard can significantly improve the FAIRness (Findable, Accessible, Interoperable, Reusable) of neuroscience data and promote data sharing and cooperation across laboratories and fields.