Assessing Confidence in AI-Assisted Grading of Physics Exams through Psychometrics: An Exploratory Study

Gerd Kortemeyer,Julian Nöhl
2024-10-25
Abstract:This study explores the use of artificial intelligence in grading high-stakes physics exams, emphasizing the application of psychometric methods, particularly Item Response Theory (IRT), to evaluate the reliability of AI-assisted grading. We examine how grading rubrics can be iteratively refined and how threshold parameters can determine when AI-generated grades are reliable versus when human intervention is necessary. By adjusting thresholds for correctness measures and uncertainty, AI can grade with high precision, significantly reducing grading workloads while maintaining accuracy. Our findings show that AI can achieve a coefficient of determination of $R^2\approx 0.91$ when handling half of the grading load, and $R^2 \approx 0.96$ for one-fifth of the load. These results demonstrate AI's potential to assist in grading large-scale assessments, reducing both human effort and associated costs. However, the study underscores the importance of human oversight in cases of uncertainty or complex problem-solving, ensuring the integrity of the grading process.
Physics Education
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the reliability and validity of using artificial intelligence (AI) to assist in the scoring of physics exams. Specifically, the research focuses on applying psychometric methods, especially Item Response Theory (IRT), to evaluate the reliability of AI - assisted scoring, and explores how to determine when AI scoring is reliable and when human intervention is required by adjusting scoring criteria and uncertainty thresholds. The research aims to reduce the scoring workload through AI while maintaining the accuracy of scoring, but emphasizes the importance of human supervision in cases of uncertainty or complex problem - solving to ensure the integrity of the scoring process. ### Research Background and Motivation With the rapid development of artificial intelligence technology, especially the excellent performance of large - language models (such as GPT - 4) in academic tasks, AI is increasingly applied in physics education, including assisting in teaching research, problem - solving, constructing new problems, and initial scoring attempts. However, the reliability of AI in scoring complex problem - solving processes remains a concern, especially in the case where the European Union regards "AI systems for evaluating learning outcomes" as "high - risk" systems and requires human supervision. ### Research Questions and Hypotheses 1. **Optimization of Scoring Rules**: - Researchers attempt to use classical and item - response psychometric methods to evaluate the scoring validity of the AI - based scoring process, including scoring rules (prompts) and scoring judgments (reasoning and performance). - Hypothesis 1: Without referring to specific answers or true results, psychometric methods can be used to iteratively improve scoring rules. 2. **Scoring Acceptance**: - Researchers need to evaluate the validity of AI scoring results for each student on each item and automatically decide whether to accept the AI scoring results or submit them for human evaluation. - Hypothesis 2: Without referring to specific answers or true results, the predictive ability of the Bayesian model can be used to evaluate the validity of AI scoring judgments for each student on each item. ### Methodology 1. **Workflow**: - Handwritten test papers are first scanned into PDF files and then converted into LaTeX format using GPT - 4o for OCR. - Use GPT - 4 - Turbo to score each question part, and iteratively adjust scoring prompts according to statistical evaluation results (such as standard deviation, IRT, and correlation). - Finally, use Python programs and Scikit, Pandas libraries to perform IRT estimation and analyze sparse data sets. 2. **Optical Character Recognition (OCR)**: - Use GPT - 4o to convert handwritten images into LaTeX format, ensuring the accuracy of text and formulas, and describing graphic content in detail. 3. **Scoring Rules**: - Scoring rules are based on sample solutions provided by teaching assistants, converted into English and marked with unique identifiers. - Prompt the LLM to award partial scores from 0% to 100%, even though teaching assistants usually make binary decisions (full marks or zero marks). 4. **Scoring**: - Use GPT - 4 - Turbo for scoring, ensuring accurate scoring according to the scoring rules, with special attention to the existence and quality of work content. ### Conclusions The research results show that AI can achieve \( R^2\approx0.91 \) when handling half of the scoring workload and \( R^2\approx0.96 \) when handling one - fifth of the scoring workload. These results demonstrate the potential of AI in large - scale assessment, which can significantly reduce manpower and costs, but emphasize the importance of human supervision in cases of uncertainty or complex problem - solving.