Evaluating AI-Generated Essays with GRE Analytical Writing Assessment

Yang Zhong,Jiangang Hao,Michael Fauss,Chen Li,Yuan Wang
2024-10-24
Abstract:The recent revolutionary advance in generative AI enables the generation of realistic and coherent texts by large language models (LLMs). Despite many existing evaluation metrics on the quality of the generated texts, there is still a lack of rigorous assessment of how well LLMs perform in complex and demanding writing assessments. This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE). We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline. Notably, the top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively, falling between "generally thoughtful, well-developed analysis of the issue and conveys meaning clearly" and "presents a competent analysis of the issue and conveys meaning with acceptable clarity" according to the GRE scoring guideline. We also evaluated the detection accuracy of these essays, with detectors trained on essays generated by the same and different LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the quality of GRE Analytical Writing test essays generated by large language models (LLMs) and explore whether these essays can be detected as being generated by AI. Specifically, the researchers focus on the following aspects: 1. **Evaluating the quality of essays generated by LLMs**: - Use human raters and ETS's automatic scoring engine e - rater® to score essays generated by ten leading large - scale language models to evaluate the performance of these models in complex and demanding writing tasks. - Compare the differences in essays generated by different LLMs in terms of following the length specified by the topic, language features, text similarity, and perplexity. 2. **Detecting AI - generated essays**: - Research whether essays generated by different LLMs can be detected, especially the effectiveness of detectors trained with the same LLM or different LLMs in identifying AI - generated essays. - Detect AI - generated essays through machine learning classifiers (such as XGBoost) based on language features generated by e - rater® and perplexity - based features. ### Research Background In recent years, the revolutionary progress of generative AI technology has enabled large - scale language models to generate realistic and coherent texts. Although there are already many indicators for evaluating the quality of generated texts, there is a lack of strict evaluation of the performance of LLMs in complex and demanding writing assessments. This study aims to fill this gap by using the GRE Analytical Writing test to evaluate essays generated by LLMs. ### Main Research Questions 1. **RQ1**: When using the GRE Analytical Writing scoring framework to evaluate essays generated by LLMs, what is the score distribution? 2. **RQ2**: What are the differences in essays generated by different LLMs in terms of following the length specified by the topic, language features, text similarity, and perplexity? 3. **RQ3**: How effective are detectors trained with language and perplexity features in identifying essays generated by the same LLM or different LLMs? ### Methods 1. **Selection of LLMs**: The study covers the latest LLMs as of mid - 2024, including GPT - 4o, Gemini, Llama3 - 8b, etc., as well as older models in mid - 2023, such as GPT - 4, GPT - 3.5 Turbo, Google's Bard, Llama, Vicuna, and Koala. 2. **Prompts and Essay Generation**: Randomly select two writing topics from the GRE writing test, and each LLM generates 100 essays, for a total of 2000 essays. 3. **Evaluation and Detection**: - **Human Scoring**: Expert raters from the GRE program score the essays according to the GRE scoring framework. - **e - rater® Scoring**: Use ETS's automatic scoring engine e - rater® to score the essays. - **Detector Performance**: Use machine learning classifiers (such as XGBoost) based on language features generated by e - rater® and perplexity - based features to detect AI - generated essays. ### Results 1. **Human Scoring**: The results scored by human raters show that essays generated by proprietary LLMs (such as GPT models and Gemini) generally score higher, with an average score of 4 to 5. 2. **e - rater® Scoring**: The e - rator® scoring results show that proprietary LLMs are still superior to open - source LLMs, but the overall scores are higher. 3. **Essay Length, Similarity, Language Features, and Perplexity**: - **Essay Length**: Newer models perform better in following the 500 - word length requirement. - **Similarity**: There are differences in semantic and word - for - word similarity in essays generated by different LLMs. - **Language Features**: LLMs perform well in grammar, mechanics, usage, and style, but still lag behind human writing in organization, development, and vocabulary complexity. - **Perplexity**: The perplexity of human writing is higher than that of AI - generated essays, and the perplexity of newer models is slightly higher than that of older models. ### Detection Results - **Intra - model Detection**: Classifiers using perplexity features are slightly better than those using e - rater® language features, but in some models (such as GPT - 4o and Gemini),