Humboldt: Metadata-Driven Extensible Data Discovery

Alex Bäuerle,Çağatay Demiralp,Michael Stonebraker
2024-08-21
Abstract:Data discovery is crucial for data management and analysis and can benefit from better utilization of metadata. For example, users may want to search data using queries like ``find the tables created by Alex and endorsed by Mike that contain sales numbers.'' They may also want to see how the data they view relates to other data, its lineage, or the quality and compliance of its upstream datasets, all metadata. Yet, effectively surfacing metadata through interactive user interfaces (UIs) to augment data discovery poses challenges. Constantly revamping UIs with each update to metadata sources (or providers) consumes significant development resources and lacks scalability and extensibility. In response, we introduce Humboldt, a new framework enabling interactive data systems to effectively leverage metadata for data discovery and rapidly evolve their UIs to support metadata changes. Humboldt decouples metadata sources from the implementation of data discovery UIs that support search and dataset visualization using metadata fields. It automatically generates interactive data discovery interfaces from declarative specifications, avoiding costly metadata-specific (re)implementations.
Databases,Human-Computer Interaction
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the key challenges in data discovery, especially how to use metadata more effectively in large - scale data management and analysis. Specifically, the paper proposes a new framework named Humboldt to address the following issues: 1. **Effective use of metadata**: - Users hope to be able to search for data by queries such as "Find the table created by Alex, recommended by Mike and containing sales figures". In addition, users also want to know the relationships between the data they view and other data, the lineage of the data, the quality and compliance of the upstream data sets, etc., all of which depend on metadata. - However, existing systems have difficulties in effectively presenting metadata in an interactive user interface (UI), especially when the metadata sources are frequently updated. Constantly rewriting UI code is both resource - consuming and lacks scalability and flexibility. 2. **Rapid evolution of UI and support for metadata changes**: - As the metadata sources are updated, the existing UI usually requires expensive and error - prone code changes, which makes it difficult for the UI to adapt to new metadata requirements in a timely manner. - Humboldt decouples the metadata sources from the UI implementation, automatically generates an interactive data discovery interface, avoids the repeated implementation for specific metadata, and thus realizes the rapid evolution of the UI and the support for metadata changes. 3. **Improve the user experience of data discovery**: - Users need context views (e.g., "Which dashboards are my teammates working on?"), exploration tools (e.g., "Show data that can be associated with the data I am currently viewing"), and filters (e.g., "Show only the analysis of specific users"). - Existing interactive data systems have limited support for metadata - based data discovery, and the UI is usually hard - coded and difficult to customize and extend according to user needs. ### Solutions The Humboldt framework solves the above problems in the following ways: - **Generate UI from declarative specifications**: Humboldt automatically generates the data discovery UI from declarative specifications, so that the integration of new metadata only requires adding a few lines of specification code without modifying the UI implementation. - **Abstract layer of metadata providers**: Humboldt serves as an interface between existing metadata providers and the data discovery UI, allowing metadata providers to be easily added, changed or removed without modifying the UI code. - **Rich interaction features**: Humboldt supports multiple views, composable query and sorting algorithms, providing users with a flexible data discovery experience. Through these designs, Humboldt not only improves the efficiency of data discovery, but also enhances the scalability and flexibility of the system, enabling it to better meet the diverse needs of different users and domains.