DataChat: Prototyping a Conversational Agent for Dataset Search and Visualization

Lizhou Fan,Sara Lafia,Lingyao Li,Fangyuan Yang,Libby Hemphill
2023-05-27
Abstract:Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter-university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine-readability of data and its documentation. There are opportunities to enhance dataset search by improving users' ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot-based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives' and institutional repositories' ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data.
Information Retrieval,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to enhance users' ability in searching and understanding research data sets through a chatbot - based system (DataChat), by using the Scholarly Knowledge Graph (SKG) and Large Language Model (LLM). Specifically, the paper focuses on: 1. **Improving the connectivity of data set search**: By constructing a knowledge graph, the relationships between data sets are made more explicit, helping users better understand and utilize these relationships. 2. **Enhancing search efficiency**: Using natural language processing techniques, users can obtain complex data set information through simple natural language queries, reducing users' learning costs and technical barriers. 3. **Increasing data visibility**: By presenting the relationships between data sets through a graphical interface, it helps users understand the structure and associations of data sets more intuitively. 4. **Strengthening interactivity**: Provide an interactive user interface that allows users to explore and manipulate data set information according to their own needs. The paper mentions that although existing data search tools provide standardized metadata and search functions, users' background knowledge and expertise are still the main obstacles to effective search and evaluation of data sets. DataChat aims to improve this situation through the above - mentioned methods, making the search and reuse of data sets more efficient and user - friendly.