Abstract:Text-based open-ended questions in academic formative and summative assessments help students become deep learners and prepare them to understand concepts for a subsequent conceptual assessment. However, grading text-based questions, especially in large (>50 enrolled students) courses, is tedious and time-consuming for instructors. Text processing models continue progressing with the rapid development of Artificial Intelligence (AI) tools and Natural Language Processing (NLP) algorithms. Especially after breakthroughs in Large Language Models (LLM), there is immense potential to automate rapid assessment and feedback of text-based responses in education. This systematic review adopts a scientific and reproducible literature search strategy based on the PRISMA process using explicit inclusion and exclusion criteria to study text-based automatic assessment systems in post-secondary education, screening 838 papers and synthesizing 93 studies. To understand how text-based automatic assessment systems have been developed and applied in education in recent years, three research questions are considered: 1) What types of automated assessment systems can be identified using input, output, and processing framework? 2) What are the educational focus and research motivations of studies with automated assessment systems? 3) What are the reported research outcomes in automated assessment systems and the next steps for educational applications? All included studies are summarized and categorized according to a proposed comprehensive framework, including the input and output of the system, research motivation, and research outcomes, aiming to answer the research questions accordingly. Additionally, the typical studies of automated assessment systems, research methods, and application domains in these studies are investigated and summarized. This systematic review provides an overview of recent educational applications of text-based assessment systems for understanding the latest AI/NLP developments assisting in text-based assessments in higher education. Findings will particularly benefit researchers and educators incorporating LLMs such as ChatGPT into their educational activities.

Towards LLM-based Autograding for Short Textual Answers

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

LLM-based automatic short answer grading in undergraduate medical education

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Using Large Language Models for Automated Grading of Student Writing about Science

A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

Automatic assessment of text-based responses in post-secondary education: A systematic review

Performance of the pre-trained large language model GPT-4 on automated short answer grading

Short Answer Grading Using One-shot Prompting and Text Similarity Scoring Model

Beyond human subjectivity and error: a novel AI grading system

Automatic Short Answer Grading via Multiway Attention Networks

Are Large Language Models Good Essay Graders?

Large Language Models As MOOCs Graders

Automatic short answer grading and feedback using text mining methods

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

LLM examiner: automating assessment in informal self-directed e-learning using ChatGPT