Abstract:Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata system is presented and evaluated as well.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the metadata management problem in distributed data sources. Specifically, as the amount of data grows, the metadata also becomes large and complex, and how to effectively manage and utilize this metadata has become an urgent challenge to be solved. To address this challenge, especially in the research data and library environments, this paper proposes a new data architecture - metadata - lake, and conducts a proof - of - concept implementation and evaluation.
### Problem Background
1. **Challenges in Metadata Management**: With the advent of the big data era, metadata has become increasingly large. Since metadata is used to describe other data, its management and use require special discussion and treatment.
2. **Importance of Metadata**: The existence, detail level, and accuracy of metadata determine the discoverability, use value, and overall value of related data.
3. **Limitations of Existing Solutions**: Traditional data warehouses and data lakes have limitations when dealing with metadata, unable to meet the interdisciplinary data discovery needs, nor can they provide a unified access interface.
### Proposed Solution
1. **Concept of Metadata Lake**: A metadata lake is a system that centrally stores and manages metadata from multiple distributed data sources. It can not only provide a single access point but also improve the discoverability and usability of data through a unified format.
2. **Implementation Method**: The paper proposes an open - source implementation named "DatAasee", aiming to provide a centralized metadata management platform for universities and research institutions. This platform is especially suitable for processing scientific research data and bibliographic metadata.
### Main Objectives
- **Centralize Metadata**: Centralize metadata from multiple sources into a unified repository for easy management and query.
- **Interdisciplinary Data Discovery**: By converting metadata of different disciplines into a unified format, help non - professionals discover and use data more easily.
- **Act as a Metadata Hub**: Provide metadata supply services for other systems, simplifying the implementation and operation of local metadata processing systems.
- **Repository in Line with FAIR Principles**: Support scientific practices that are findable, accessible, interoperable, and reusable.
- **Funding Compliance**: Meet the requirements of funding agencies for the discoverability of research data.
### Conclusion
By constructing and implementing the metadata lake, this paper aims to solve the complexity and challenges in metadata management in distributed data sources, providing a new idea and tool for the efficient management and utilization of scientific research data.