Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Fan Gao,Hang Jiang,Rui Yang,Qingcheng Zeng,Jinghui Lu,Moritz Blum,Dairui Liu,Tianwei She,Yuang Jiang,Irene Li
2024-05-23
Abstract:Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, Large Language Models (LLMs) have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors, inluding GPT-3.5, PaLM2, and LLaMa2 by margins ranging from 2% to 20% in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.
Computation and Language
What problem does this paper attempt to address?
The paper aims to explore the capabilities of large language models (LLMs) in generating review articles on specific concepts within the field of computer science. Specifically, the research focuses on the following three research questions (RQs): 1. **How capable are LLMs in generating review articles on NLP concepts?** Researchers evaluated the capabilities of different models (such as GPT-4, GPT-3.5, PaLM2, and LLaMa2) by comparing their performance in generating reviews on specific NLP topics. The results showed that GPT-4 performed the best in most cases, but there were still some shortcomings, such as information omissions or factual errors. 2. **Can LLMs simulate human judgment given specific criteria?** Experiments indicated that when provided with detailed guidelines, the content generated by LLMs had a high consistency with human expert judgments. However, certain biases were also found, particularly in terms of detail completeness and accuracy. 3. **Do LLMs introduce significant bias when evaluating machine-generated texts?** The results showed that GPT-4 exhibited preference bias when evaluating machine-generated articles, tending to give higher scores to content it generated itself. This suggests that fully replacing human experts with GPT-4 for evaluation still presents challenges. Overall, although LLMs can generate high-quality review articles when following specific guidelines, further improvements are needed in certain areas, especially in verifying factual accuracy, which still requires the involvement of human experts.