Generative Language Models Potential for Requirement Engineering Applications: Insights into Current Strengths and Limitations

Summra Saleem,Muhammad Nabeel Asim,Ludger Van Elst,Andreas Dengel

2024-12-02

Abstract:Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7,445 samples, requirements extraction),PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77,respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? The main purpose of this paper is to explore and evaluate the potential of the latest generative language models (such as ChatGPT and Gemini) in requirements engineering applications. Specifically, the paper attempts to solve the following key problems: 1. **Potential of generative language models**: - The paper aims to explore whether the latest generative language models (such as ChatGPT and Gemini) can effectively support various tasks in requirements engineering, such as requirements extraction, requirements classification, named entity recognition (NER), and question - answering systems. 2. **Impact of different prompt strategies**: - It studies the impact of different levels of domain - knowledge prompts on the prediction accuracy of generative language models. By designing three different types of prompts (containing different degrees of domain knowledge), the paper analyzes how these prompts affect the performance of the models. 3. **Performance comparison**: - On four different public benchmark datasets, it compares the performance of ChatGPT and Gemini with existing task - specific machine learning / deep learning predictors and traditional language models. The four datasets are: - Pure dataset (7,445 samples, requirements extraction) - PROMISE dataset (622 samples, requirements classification) - REQuestA dataset (300 question - answer pairs) - Aerospace dataset (6,347 words, requirements NER tagging) 4. **Differences in model performance**: - Through experiments, it is found that Gemini requires more elaborate prompt engineering than ChatGPT to provide accurate predictions. In addition, in requirements extraction, requirements classification, and named entity recognition tasks, the performance of these two models is generally lower than the current state - of - the - art level, but they perform well in question - answering system tasks. ### Summary In general, through a series of experiments and comparative analyses, this paper reveals the potential and limitations of generative language models in requirements engineering applications and provides valuable insights for future research.

Generative Language Models Potential for Requirement Engineering Applications: Insights into Current Strengths and Limitations

Investigating ChatGPT's Potential to Assist in Requirements Elicitation Processes

On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML

The Battle of LLMs: A Comparative Study in Conversational QA Tasks

An In-depth Look at Gemini's Language Abilities

Analyzing Large language models chatbots: An experimental approach using a probability test

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

Comparative Analysis of CHATGPT and the evolution of language models

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice

Evaluation of the Programming Skills of Large Language Models

Large Language Models: Their Success and Impact

Empirical Evaluation of ChatGPT on Requirements Information Retrieval Under Zero-Shot Setting

ChatGPT Alternative Solutions: Large Language Models Survey

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Demystifying ChatGPT: An In-depth Survey of OpenAI's Robust Large Language Models

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation