Abstract:This study explores the use of artificial intelligence in grading high-stakes physics exams, emphasizing the application of psychometric methods, particularly Item Response Theory (IRT), to evaluate the reliability of AI-assisted grading. We examine how grading rubrics can be iteratively refined and how threshold parameters can determine when AI-generated grades are reliable versus when human intervention is necessary. By adjusting thresholds for correctness measures and uncertainty, AI can grade with high precision, significantly reducing grading workloads while maintaining accuracy. Our findings show that AI can achieve a coefficient of determination of $R^2\approx 0.91$ when handling half of the grading load, and $R^2 \approx 0.96$ for one-fifth of the load. These results demonstrate AI's potential to assist in grading large-scale assessments, reducing both human effort and associated costs. However, the study underscores the importance of human oversight in cases of uncertainty or complex problem-solving, ensuring the integrity of the grading process.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the reliability and validity of using artificial intelligence (AI) to assist in the scoring of physics exams. Specifically, the research focuses on applying psychometric methods, especially Item Response Theory (IRT), to evaluate the reliability of AI - assisted scoring, and explores how to determine when AI scoring is reliable and when human intervention is required by adjusting scoring criteria and uncertainty thresholds. The research aims to reduce the scoring workload through AI while maintaining the accuracy of scoring, but emphasizes the importance of human supervision in cases of uncertainty or complex problem - solving to ensure the integrity of the scoring process. ### Research Background and Motivation With the rapid development of artificial intelligence technology, especially the excellent performance of large - language models (such as GPT - 4) in academic tasks, AI is increasingly applied in physics education, including assisting in teaching research, problem - solving, constructing new problems, and initial scoring attempts. However, the reliability of AI in scoring complex problem - solving processes remains a concern, especially in the case where the European Union regards "AI systems for evaluating learning outcomes" as "high - risk" systems and requires human supervision. ### Research Questions and Hypotheses 1. **Optimization of Scoring Rules**: - Researchers attempt to use classical and item - response psychometric methods to evaluate the scoring validity of the AI - based scoring process, including scoring rules (prompts) and scoring judgments (reasoning and performance). - Hypothesis 1: Without referring to specific answers or true results, psychometric methods can be used to iteratively improve scoring rules. 2. **Scoring Acceptance**: - Researchers need to evaluate the validity of AI scoring results for each student on each item and automatically decide whether to accept the AI scoring results or submit them for human evaluation. - Hypothesis 2: Without referring to specific answers or true results, the predictive ability of the Bayesian model can be used to evaluate the validity of AI scoring judgments for each student on each item. ### Methodology 1. **Workflow**: - Handwritten test papers are first scanned into PDF files and then converted into LaTeX format using GPT - 4o for OCR. - Use GPT - 4 - Turbo to score each question part, and iteratively adjust scoring prompts according to statistical evaluation results (such as standard deviation, IRT, and correlation). - Finally, use Python programs and Scikit, Pandas libraries to perform IRT estimation and analyze sparse data sets. 2. **Optical Character Recognition (OCR)**: - Use GPT - 4o to convert handwritten images into LaTeX format, ensuring the accuracy of text and formulas, and describing graphic content in detail. 3. **Scoring Rules**: - Scoring rules are based on sample solutions provided by teaching assistants, converted into English and marked with unique identifiers. - Prompt the LLM to award partial scores from 0% to 100%, even though teaching assistants usually make binary decisions (full marks or zero marks). 4. **Scoring**: - Use GPT - 4 - Turbo for scoring, ensuring accurate scoring according to the scoring rules, with special attention to the existence and quality of work content. ### Conclusions The research results show that AI can achieve $ R^2\approx0.91 $ when handling half of the scoring workload and $ R^2\approx0.96 $ when handling one - fifth of the scoring workload. These results demonstrate the potential of AI in large - scale assessment, which can significantly reduce manpower and costs, but emphasize the importance of human supervision in cases of uncertainty or complex problem - solving.

Assessing Confidence in AI-Assisted Grading of Physics Exams through Psychometrics: An Exploratory Study

Can an AI-tool grade assignments in an introductory physics course?

Grading Assistance for a Handwritten Thermodynamics Exam using Artificial Intelligence: An Exploratory Study

Beyond human subjectivity and error: a novel AI grading system

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation

Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments

Auto-assessment of assessment: A conceptual framework towards fulfilling the policy gaps in academic assessment practices

Examining the responsible use of zero-shot AI approaches to scoring essays

AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

The Impact of AI in Physics Education: A Comprehensive Review from GCSE to University Levels

Automated Assessment of Multimodal Answer Sheets in the STEM domain

The Accuracy of AI-Based Automatic Proctoring in Online Exams

An AI-Based System for Formative and Summative Assessment in Data Science Courses

Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4

Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams

Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters

Evaluating AI and Human Authorship Quality in Academic Writing through Physics Essays

Evaluating the ethics of machines assessing humans The case of AQA: An assessment organisation and exam board in England

Cheat sites and artificial intelligence usage in online introductory physics courses: what is the extent and what effect does it have on assessments?