Abstract:This study explores the effectiveness of Large Language Models (LLMs) for Automatic Question Generation in educational settings. Three LLMs are compared in their ability to create questions from university slide text without fine-tuning. Questions were obtained in a two-step pipeline: first, answer phrases were extracted from slides using Llama 2-Chat 13B; then, the three models generated questions for each answer. To analyze whether the questions would be suitable in educational applications for students, a survey was conducted with 46 students who evaluated a total of 246 questions across five metrics: clarity, relevance, difficulty, slide relation, and question-answer alignment. Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform Flan T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment. GPT-3.5 especially excels at tailoring questions to match the input answers. The contribution of this research is the analysis of the capacity of LLMs for Automatic Question Generation in education.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate and compare the effectiveness of large language models (LLMs) in automatically generating context - related questions in educational scenarios. Specifically, the researchers focus on how to use university course slide texts as context and generate educational questions suitable for students through different LLMs. The main problems mentioned in the paper can be summarized as follows: 1. **How to use LLMs to generate high - quality questions related to the context of educational materials?** - The researchers designed a two - step pipeline: first, extract answer phrases from the slides, and then use three different LLMs (GPT - 3.5 Turbo, Flan T5 XXL and Llama 2 - Chat 13B) to generate corresponding questions. 2. **What are the differences in the performance of different LLMs in generating educational questions?** - To evaluate the quality of these questions, the researchers conducted a survey, inviting 46 students to evaluate 246 generated questions. The evaluation indicators include clarity, relevance, difficulty, slide - relatedness and question - answer alignment. 3. **Can the questions generated by these LLMs meet the needs of educational applications?** - The research results show that GPT - 3.5 Turbo and Llama 2 - Chat 13B are superior to Flan T5 XXL in multiple indicators, especially in terms of clarity and question - answer alignment. However, all models can still generate high - quality questions without fine - tuning, which indicates that LLMs have great potential for application in the education field. In general, this paper aims to explore and verify the ability of LLMs to automatically generate context - related questions in educational scenarios, and provide specific performance comparisons and application suggestions through empirical research.

Comparison of Large Language Models for Generating Contextually Relevant Questions

Supervised Knowledge Makes Large Language Models Better In-context Learners

Leveraging Large Language Models for Multiple Choice Question Answering

Research on the Application of Large Language Models in Automatic Question Generation: A Case Study of ChatGLM in the Context of High School Information Technology Curriculum

Spoken Language Intelligence of Large Language Models for Language Learning

Analyzing Large Language Models for Classroom Discussion Assessment

Leveraging Large Language Models to Generate Course-specific Semantically Annotated Learning Objects

A Large Language Model Approach to Educational Survey Feedback Analysis

Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation

Application of Large Language Models in Automated Question Generation: A Case Study on ChatGLM's Structured Questions for National Teacher Certification Exams

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools

Investigating Answerability of LLMs for Long-Form Question Answering

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Evaluating Large Language Models in Analysing Classroom Dialogue

Examining Long-Context Large Language Models for Environmental Review Document Comprehension

Adapting Large Language Models for Education: Foundational Capabilities, Potentials, and Challenges