Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course

Sebastian Kahl,Felix Löffler,Martin Maciol,Fabian Ridder,Marius Schmitz,Jennifer Spanagel,Jens Wienkamp,Christopher Burgahn,Malte Schilling
2024-08-03
Abstract:This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course. In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning. We assessed the different models and applied techniques using common similarity metrics like BLEU-4, ROUGE, and BERTScore, complemented by a small human evaluation of helpfulness and trustworthiness. Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers. In the context of education, RAG appears as an ideal technique as it is based on enriching the input of the model with additional information and material which usually is already present for a university course. Fine-tuning, on the other hand, can produce quite small, still strong expert models, but poses the danger of overfitting. Our study further asks how we measure performance of LLMs and how well current measurements represent correctness or relevance? We find high correlation on similarity metrics and a bias of most of these metrics towards shorter responses. Overall, our research points to both the potential and challenges of integrating LLMs in educational settings, suggesting a need for balanced training approaches and advanced evaluation frameworks.
Computation and Language,Artificial Intelligence,Computers and Society,Robotics
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the performance of large language models (LLMs) in university courses, particularly their effectiveness as AI-assisted teaching tools. Specifically, the researchers explore the following questions: 1. **How to improve the quality of LLMs' responses in specific knowledge domains?** The researchers employed various advanced techniques such as Prompt Engineering, Retrieval-Augmented-Generation (RAG), and Fine-Tuning to enhance the performance of LLMs in specific courses. They hope these techniques can reduce the models' hallucinations (generating incorrect or unrealistic answers) and provide more accurate factual answers. 2. **How to evaluate the performance of LLMs?** Current methods for evaluating LLMs have some issues, such as inconsistent results and difficulty in measuring factual accuracy. Therefore, the researchers explored how to more comprehensively evaluate the performance of LLMs through different metrics (such as BLEU-4, ROUGE, BERTScore, etc.) and analyzed the relationships and potential biases among these metrics. 3. **How to safely use LLMs in educational environments?** Educational environments require a high degree of factual correctness and credibility. Thus, the researchers paid special attention to preventing LLMs from generating incorrect information and ensuring that their answers are consistent with the course content. In summary, the main goal of this paper is to explore and evaluate the potential of different techniques in improving the effectiveness and reliability of LLMs as teaching aids, while also proposing suggestions for improving evaluation methods.