Abstract:Querying structured databases with natural language (NL2SQL) has remained a difficult problem for years. Recently, the advancement of machine learning (ML), natural language processing (NLP), and large language models (LLM) have led to significant improvements in performance, with the best model achieving ∼ 85% percent accuracy on the benchmark Spider dataset. However, there is a lack of a systematic understanding of the types, causes, and effectiveness of error-handling mechanisms of errors for erroneous queries nowadays. To bridge the gap, a taxonomy of errors made by four representative NL2SQL models was built in this work, along with an in-depth analysis of the errors. Second, the causes of model errors were explored by analyzing the model-human attention alignment to the natural language query. Last, a within-subjects user study with 26 participants was conducted to investigate the effectiveness of three interactive error-handling mechanisms in NL2SQL. Findings from this paper shed light on the design of model structure and error discovery and repair strategies for natural language data query interfaces in the future.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper attempts to solve the error problems in natural language to SQL query (NL2SQL). Although recent progress in machine learning, natural language processing, and large - language models has significantly improved the performance of NL2SQL tasks, there is still a lack of systematic understanding of error types, causes, and the effectiveness of error - handling mechanisms. Specifically, the paper mainly focuses on the following points:
1. **Error classification**: A four - category error classification system for representative NL2SQL models was constructed, and these errors were analyzed in - depth.
2. **Error causes**: By analyzing the alignment of model and human attention to natural language queries, the causes of model errors were explored.
3. **User - handling strategies**: Through a within - subject user study with 26 participants, the effectiveness of three interactive error - handling mechanisms was investigated.
### Background and motivation
Data querying is a crucial step in the data analysis, understanding, and decision - making processes. However, traditional data query interfaces require users to use formal languages (such as SQL), which poses a significant learning barrier for non - expert users. To solve this problem, natural - language data query interfaces allow users to express data queries in natural language, thereby lowering the threshold for data querying and enabling users to flexibly explore data.
Although the development of deep learning and large - language models in recent years has significantly improved the performance of NL2SQL tasks, the model performance seems to have stagnated at around 85%, indicating that there are bottlenecks in relying solely on model methods. Therefore, the paper focuses on the 15% of error queries outside the 85% accuracy rate, attempting to further improve the performance of NL2SQL systems by understanding error types, causes, and user - handling strategies.
### Main contributions
1. **Error classification system**: Through iterative and axial coding procedures, a representative error classification system for the latest NL2SQL models was developed.
2. **Attention - alignment analysis**: A comprehensive analysis was conducted to compare model attention and human attention, and the results showed that NL2SQL errors are highly correlated with misaligned attention.
3. **User study**: Through a controlled user study, the effectiveness and efficiency of three representative NL2SQL error - discovery and - repair methods were investigated.
4. **Implications for future design**: The implications for the design of error - handling mechanisms in future natural - language query interfaces were discussed.
### Methods and results
1. **Error classification**:
- Four representative NL2SQL models (DIN - SQL + GPT - 4, SmBop + GraPPa, BRIDGE v2 + BERT, GAZP + BERT) were selected.
- All queries with different execution results from the true results generated by each model on the Spider dataset were collected.
- Four authors conducted multiple rounds of qualitative coding and refinement, and finally obtained the error classification system.
2. **Attention - alignment analysis**:
- Two SQL experts manually annotated important words in natural language queries (human attention).
- The weight of each word in the model prediction was calculated by the perturbation method (model attention).
- The alignment of attention was measured by calculating the overlap between human - attended words and model - attended words.
- The results showed that there were significant differences in attention alignment between error queries and correctly predicted queries, indicating that NL2SQL errors are highly correlated with misaligned attention.
3. **User study**:
- Three representative error - handling mechanisms were selected: the explanation - and - example - based method (DIY), the SQL - visualization - based method on explanation (SQLVis), and the dialogue - based method.
- A controlled user study with 26 participants was conducted to investigate the effectiveness of these mechanisms in improving the efficiency and accuracy of error discovery and repair.
- The research results showed that these error - handling mechanisms had limited effectiveness in improving the efficiency of error discovery and repair on complex datasets.
### Conclusion
This paper systematically analyzes and classifies the errors of NL2SQL models, reveals the high correlation between errors and misaligned attention, and evaluates the effectiveness of different error - handling mechanisms through user studies. These findings provide important implications for the design of more effective error - handling mechanisms in future natural - language query interfaces.