Abstract:Online Judge Systems (OJSs) play a crucial role in evaluating SQL programming skills. However, OJSs may not accurately evaluate students’ queries due to the error-detection capabilities of test sets are insufficient, resulting in false positives that can mislead students and hinder their learning. This study analyzes a large-scale OJS’s evaluation dataset and identifies more than 110,000 (1.94%) false positive queries. It also validates existing SQL error categorization and reveals a new type of logical error called deceptive error, which occurs when students construct queries that pass specific test cases but fail to solve the actual problem. This type of error has been overlooked in previous research and can provide new insights into how to improve OJSs by enhancing test cases and feedback. This study contributes to the understanding of assessment and evaluation practices and processes in programming education, particularly the contribution that OJSs make to student learning and to course, staff and institutional development. It also suggests error prevention and detection techniques that can improve the effectiveness and fairness of OJSs in programming education and competitions.

What problem does this paper attempt to address?

This paper attempts to solve the misjudgment problems that occur when online judging systems (OJSs) evaluate SQL queries, especially false positives and deceptive errors. These problems may lead to the situation where the queries submitted by students pass the test cases but do not actually solve the problems correctly, thus misleading students and affecting their learning outcomes. Specifically, the paper focuses on the following two research questions: 1. **RQ1**: What kinds of errors exist in SQL queries that are wrongly considered correct by automatic evaluation systems (such as OJSs), and what are the characteristics of these kinds of errors? 2. **RQ2**: What kinds of errors exist in SQL queries that aim to mislead automatic evaluation systems (such as OJSs), and what are the characteristics of these kinds of errors? ### Background and Motivation - **The Importance of SQL Skills**: With the wide application of database systems, SQL skills have become one of the core competencies of computer science students and are highly valued by the software industry. - **Challenges of Manual Evaluation**: Manually evaluating SQL queries is not only time - consuming but also difficult to provide timely feedback. Therefore, automatic evaluation tools (such as OJSs) have important application value in higher education. - **Limitations of OJSs**: Although OJSs are widely used in programming education, they mainly rely on the comparison of execution results and do not evaluate the correctness of query logic, which leads to the occurrence of false positives and deceptive errors. ### Research Methods 1. **Data Collection**: - Use a public data set containing more than 5.8 million SQL queries submitted by students and 22 programming problems. - Each programming problem in the data set provides requirements and the corresponding table structure. - Students can choose different DBMSs (such as MySQL, MS SQL Server, Oracle) and submit queries. The OJS will execute the queries and return the execution results, expected results, and compilation messages. 2. **Detecting Erroneous Queries**: - Use the Python Faker library to generate stronger test cases to detect false positives. - Verify the error - detecting ability of test cases through the mutation testing method. - Execute all the queries marked as correct and re - mark the queries with incorrect execution results as false positives. 3. **Data Analysis**: - Adopt directed content analysis and general content analysis methods to classify and analyze the errors in false positives. - Directed content analysis is used to verify or extend the existing SQL error classification framework. - General content analysis is used to name and classify newly discovered errors. ### Results - **False Positive Proportion**: Among approximately 5.8 million queries, about 1.94% (112,200 queries) are marked as false positives. - **Logical Errors**: The queries in false positives pass some test cases but fail to solve the problems correctly, so they belong to logical errors. - **Deceptive Errors**: Although some queries obtain correct execution results on specific test cases, their semantics clearly do not meet the requirements of the topic. These queries are defined as deceptive errors. ### Classification of Logical Errors - **Boundary Value Errors**: - Range - Specific Boundary Error - **Constraint Formulation Deviations**: - Constraint Omission Error - Redundant Constraint Error - Misplaced Condition Error - **Redundancy - Sort Discrepancies**: - Unintended Record Deduplication - Redundant Record Inclusion - Column Ordering Deviation

False Positives and Deceptive Errors in SQL Assessment: A Large-scale Analysis of Online Judge Systems

Who Judges the Judge: an Empirical Study on Online Judge Tests

Detecting DBMS Bugs with Context-Sensitive Instantiation and Multi-Plan Execution

SQLRepair: Identifying and Repairing Mistakes in Student-Authored SQL Queries

How problem difficulty and order influence programming education outcomes in online judge systems

The Development of an Automatic Test Assembly System for a Formative Assessment in Mastery Learning Instruction: Case of the SQL Mastery Course

Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education

Edit Based Grading of SQL Queries

Finding Logic Bugs in Spatial Database Engines via Affine Equivalent Inputs

A Robust Computerized Adaptive Testing Approach in Educational Question Retrieval

Student Mastery or AI Deception? Analyzing ChatGPT's Assessment Proficiency and Evaluating Detection Strategies

VisOJ: real-time visual learning analytics dashboard for online programming judge

Small Private Online Judge: A New Tool for Empirical Education Research

Programming Online Judge System

Understanding Help-Seeking Behavior of Students Using LLMs vs. Web Search for Writing SQL Queries

Automatically Learning Topics and Difficulty Levels of Problems in Online Judge Systems

Insights into Natural Language Database Query Errors: From Attention Misalignment to User Handling Strategies

Identifying Student Profiles Within Online Judge Systems Using Explainable Artificial Intelligence

Explaining Wrong Queries Using Small Examples

A Survey on Online Judge Systems and Their Applications

Question-Score Identity Detection (Q-SID): A Statistical Algorithm to Detect Collusion Groups with Error Quantification from Exam Question Scores