Semantic Modelling of Organizational Knowledge as a Basis for Enterprise Data Governance 4.0 -- Application to a Unified Clinical Data Model

Miguel AP Oliveira,Stephane Manara,Bruno Molé,Thomas Muller,Aurélien Guillouche,Lysann Hesske,Bruce Jordan,Gilles Hubert,Chinmay Kulkarni,Pralipta Jagdev,Cedric R. Berger
2023-11-24
Abstract:Individuals and organizations cope with an always-growing amount of data, which is heterogeneous in its contents and formats. An adequate data management process yielding data quality and control over its lifecycle is a prerequisite to getting value out of this data and minimizing inherent risks related to multiple usages. Common data governance frameworks rely on people, policies, and processes that fall short of the overwhelming complexity of data. Yet, harnessing this complexity is necessary to achieve high-quality standards. The latter will condition any downstream data usage outcome, including generative artificial intelligence trained on this data. In this paper, we report our concrete experience establishing a simple, cost-efficient framework that enables metadata-driven, agile and (semi-)automated data governance (i.e. Data Governance 4.0). We explain how we implement and use this framework to integrate 25 years of clinical study data at an enterprise scale in a fully productive environment. The framework encompasses both methodologies and technologies leveraging semantic web principles. We built a knowledge graph describing avatars of data assets in their business context, including governance principles. Multiple ontologies articulated by an enterprise upper ontology enable key governance actions such as FAIRification, lifecycle management, definition of roles and responsibilities, lineage across transformations and provenance from source systems. This metadata model is the keystone to data governance 4.0: a semi-automatised data management process that considers the business context in an agile manner to adapt governance constraints to each use case and dynamically tune it based on business changes.
Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The paper mainly addresses the issues and challenges encountered when implementing traditional Data Governance (DG) in large and complex organizations and proposes a new solution—Data Governance 4.0 (DG4.0). ### Main Issues 1. **Inherent Contradictions**: - Data governance brings constraints, but who is willing to pay the cost? - Data governance is an organization-wide task, but partial implementation within the organization is ineffective. - There is a discrepancy between the overall benefits of data governance and the specific constraints it imposes locally. 2. **Challenges Related to Digital Awareness**: - In theory, everyone agrees on the benefits of data governance, but it is difficult to quantify its specific benefits. - Overly optimistic product marketing in the market raises expectations, but satisfaction in actual application is low. - Managers need to be educated on specific management methods related to data governance projects. - The results of data governance are difficult to present in an intuitive way, making it less attractive in a results-based reward system. 3. **Operational Challenges**: - The data landscape is very complex, difficult to fully grasp, and needs to be updated constantly. - There are different mindsets and incentive logics between IT departments and business departments in large enterprises. - As the amount of data grows, the technical infrastructure and operational costs of data governance also increase. ### Solution The paper proposes the concept of "Data Governance 4.0," an agile, business-adaptive, and semi-automated way of governing data assets. The core idea is to describe data assets and their business context through metadata, thereby dynamically adjusting data governance strategies to adapt to business changes. Specifically: - **Enterprise Knowledge Graph**: Using semantic web standards to build a knowledge graph that can describe data assets and their business environment. This knowledge graph includes not only information about the datasets themselves but also information about the enterprise architecture and business processes in which the datasets are located. - **Semantic Models**: Using semantic web technologies (such as RDF, OWL, etc.) to build enterprise ontologies that can describe data assets, business context, and data governance requirements. - **Application of FAIR Principles**: Ensuring data is Findable, Accessible, Interoperable, and Reusable, which helps improve the quality and usability of data. - **Agility and Automation**: Achieving agile data governance through a metadata-driven approach and automating the data governance process as much as possible. Through the above measures, the authors demonstrate how to apply this new data governance framework in actual projects, particularly in the integration of 25 years of clinical research data. This approach aims to overcome many of the limitations of traditional data governance frameworks, thereby improving the effectiveness and efficiency of data governance.