Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Fan Gao,Hang Jiang,Rui Yang,Qingcheng Zeng,Jinghui Lu,Moritz Blum,Dairui Liu,Tianwei She,Yuang Jiang,Irene Li

2024-05-23

Abstract:Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, Large Language Models (LLMs) have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors, inluding GPT-3.5, PaLM2, and LLaMa2 by margins ranging from 2% to 20% in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.

Computation and Language

What problem does this paper attempt to address?

The paper aims to explore the capabilities of large language models (LLMs) in generating review articles on specific concepts within the field of computer science. Specifically, the research focuses on the following three research questions (RQs): 1. **How capable are LLMs in generating review articles on NLP concepts?** Researchers evaluated the capabilities of different models (such as GPT-4, GPT-3.5, PaLM2, and LLaMa2) by comparing their performance in generating reviews on specific NLP topics. The results showed that GPT-4 performed the best in most cases, but there were still some shortcomings, such as information omissions or factual errors. 2. **Can LLMs simulate human judgment given specific criteria?** Experiments indicated that when provided with detailed guidelines, the content generated by LLMs had a high consistency with human expert judgments. However, certain biases were also found, particularly in terms of detail completeness and accuracy. 3. **Do LLMs introduce significant bias when evaluating machine-generated texts?** The results showed that GPT-4 exhibited preference bias when evaluating machine-generated articles, tending to give higher scores to content it generated itself. This suggests that fully replacing human experts with GPT-4 for evaluation still presents challenges. Overall, although LLMs can generate high-quality review articles when following specific guidelines, further improvements are needed in certain areas, especially in verifying factual accuracy, which still requires the involvement of human experts.

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

A Large Language Model Approach to Educational Survey Feedback Analysis

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Large Language Models Meet NLP: A Survey

A Survey on Evaluation of Large Language Models

AutoSurvey: Large Language Models Can Automatically Write Surveys

A Survey of Large Language Models

Large Language Models: A Survey

A Survey on Evaluation of Large Language ModelsJust Accepted

Large Language Models for Data Annotation and Synthesis: A Survey

Evaluating Large Language Models: A Comprehensive Survey

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

Large Language Models for Education: A Survey and Outlook

An Evaluation of Large Language Models in Bioinformatics Research

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools

A Closer Look into Using Large Language Models for Automatic Evaluation