Abstract:The recent revolutionary advance in generative AI enables the generation of realistic and coherent texts by large language models (LLMs). Despite many existing evaluation metrics on the quality of the generated texts, there is still a lack of rigorous assessment of how well LLMs perform in complex and demanding writing assessments. This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE). We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline. Notably, the top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively, falling between "generally thoughtful, well-developed analysis of the issue and conveys meaning clearly" and "presents a competent analysis of the issue and conveys meaning with acceptable clarity" according to the GRE scoring guideline. We also evaluated the detection accuracy of these essays, with detectors trained on essays generated by the same and different LLMs.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the quality of GRE Analytical Writing test essays generated by large language models (LLMs) and explore whether these essays can be detected as being generated by AI. Specifically, the researchers focus on the following aspects: 1. **Evaluating the quality of essays generated by LLMs**: - Use human raters and ETS's automatic scoring engine e - rater® to score essays generated by ten leading large - scale language models to evaluate the performance of these models in complex and demanding writing tasks. - Compare the differences in essays generated by different LLMs in terms of following the length specified by the topic, language features, text similarity, and perplexity. 2. **Detecting AI - generated essays**: - Research whether essays generated by different LLMs can be detected, especially the effectiveness of detectors trained with the same LLM or different LLMs in identifying AI - generated essays. - Detect AI - generated essays through machine learning classifiers (such as XGBoost) based on language features generated by e - rater® and perplexity - based features. ### Research Background In recent years, the revolutionary progress of generative AI technology has enabled large - scale language models to generate realistic and coherent texts. Although there are already many indicators for evaluating the quality of generated texts, there is a lack of strict evaluation of the performance of LLMs in complex and demanding writing assessments. This study aims to fill this gap by using the GRE Analytical Writing test to evaluate essays generated by LLMs. ### Main Research Questions 1. **RQ1**: When using the GRE Analytical Writing scoring framework to evaluate essays generated by LLMs, what is the score distribution? 2. **RQ2**: What are the differences in essays generated by different LLMs in terms of following the length specified by the topic, language features, text similarity, and perplexity? 3. **RQ3**: How effective are detectors trained with language and perplexity features in identifying essays generated by the same LLM or different LLMs? ### Methods 1. **Selection of LLMs**: The study covers the latest LLMs as of mid - 2024, including GPT - 4o, Gemini, Llama3 - 8b, etc., as well as older models in mid - 2023, such as GPT - 4, GPT - 3.5 Turbo, Google's Bard, Llama, Vicuna, and Koala. 2. **Prompts and Essay Generation**: Randomly select two writing topics from the GRE writing test, and each LLM generates 100 essays, for a total of 2000 essays. 3. **Evaluation and Detection**: - **Human Scoring**: Expert raters from the GRE program score the essays according to the GRE scoring framework. - **e - rater® Scoring**: Use ETS's automatic scoring engine e - rater® to score the essays. - **Detector Performance**: Use machine learning classifiers (such as XGBoost) based on language features generated by e - rater® and perplexity - based features to detect AI - generated essays. ### Results 1. **Human Scoring**: The results scored by human raters show that essays generated by proprietary LLMs (such as GPT models and Gemini) generally score higher, with an average score of 4 to 5. 2. **e - rater® Scoring**: The e - rator® scoring results show that proprietary LLMs are still superior to open - source LLMs, but the overall scores are higher. 3. **Essay Length, Similarity, Language Features, and Perplexity**: - **Essay Length**: Newer models perform better in following the 500 - word length requirement. - **Similarity**: There are differences in semantic and word - for - word similarity in essays generated by different LLMs. - **Language Features**: LLMs perform well in grammar, mechanics, usage, and style, but still lag behind human writing in organization, development, and vocabulary complexity. - **Perplexity**: The perplexity of human writing is higher than that of AI - generated essays, and the perplexity of newer models is slightly higher than that of older models. ### Detection Results - **Intra - model Detection**: Classifiers using perplexity features are slightly better than those using e - rater® language features, but in some models (such as GPT - 4o and Gemini),

Evaluating AI-Generated Essays with GRE Analytical Writing Assessment

Modifying AI, Enhancing Essays: How Active Engagement with Generative AI Boosts Writing Quality

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs

Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays

Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits

ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models

Benchmarking LLMs' Judgments with No Gold Standard

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

ChatGPT versus human essayists: an exploration of the impact of artificial intelligence for authorship and academic integrity in the humanities

Exploring the Role of Artificial Intelligence in Facilitating Assessment of Writing Performance in Second Language Learning

AI-generated feedback on writing: insights into efficacy and ENL student preference

Testing the capacity of Bard and ChatGPT for writing essays on ethical dilemmas: A cross-sectional study

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Beyond Turing Test: Can GPT-4 Sway Experts' Decisions?

Evaluation of the Effect of Generative AI in English Writing Class

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education

LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation