NL2KQL: From Natural Language to Kusto Query

Amir H. Abdi,Xinye Tang,Jeremias Eichelbaum,Mahan Das,Alex Klein,Nihal Irmak Pakis,William Blum,Daniel L Mace,Tanvi Raja,Namrata Padmanabhan,Ye Xing
2024-04-16
Abstract:Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses large language models (LLMs) to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.
Databases,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the problem of converting Natural Language Queries (NLQ) into Kusto Query Language (KQL). Specifically, the research team proposes an innovative framework called NL2KQL, which leverages Large Language Models (LLMs) to achieve this conversion process. The NL2KQL framework includes several key components: 1. **Schema Refiner**: This component is responsible for filtering out the most relevant elements from the database schema to reduce potential errors during the generation process. 2. **Few-shot Selector**: Dynamically selects a small number of example data relevant to the current task, which helps the model better understand the query requirements in a specific context. 3. **Query Refiner**: Responsible for fixing syntactic and semantic errors in the generated KQL queries, ensuring their validity in the target Kusto database. Additionally, the paper introduces a method for generating a large number of synthetic NLQ-KQL pairs that are valid in a specific database environment. To validate the effectiveness of NL2KQL, the research team employed a series of online and offline metrics and conducted ablation studies to assess the importance of each component. Ultimately, the research team released the first benchmark test set for KQL generation and the related data catalog for other researchers to reference and use. Overall, the goal of this research is to lower the technical barrier, enabling more users to easily interact with data through natural language, especially in scenarios dealing with semi-structured big data such as logs, telemetry data, and time-series data. In this way, NL2KQL is expected to enhance the efficiency and accessibility of data analysis.