ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow Discussions

Leuson Da Silva,Jordan Samhi,Foutse Khomh
2024-02-14
Abstract:Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers' queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT's release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to (ii) measure user engagement evolution with Stack Overflow over time; (ii) quantify the reliability of LLMs' answers and their potential to replace Stack Overflow in the long term; (iii) identify and understand why LLMs fails; and (iv) compare LLMs together. Our empirical results are unequivocal: ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly include the following aspects: 1. **Evaluating the impact of large language models (LLM) on Stack Overflow user engagement**: Researchers evaluate the specific impact of ChatGPT's release on Stack Overflow user engagement by analyzing activity data on Stack Overflow before and after ChatGPT's release, such as changes in the number of questions, answers, and comments. 2. **Quantifying the reliability of LLM - generated answers and their potential to replace Stack Overflow**: Researchers select some questions with accepted answers, use two large language models, ChatGPT and LLaMA, to generate answers, and compare them with the original answers to evaluate the reliability of the answers generated by these models and their potential to replace Stack Overflow in the long term. 3. **Identifying the reasons for LLM failures**: Researchers analyze what types of questions are more difficult for LLM, and explore the factors that lead to these models generating unreliable or inaccurate answers. 4. **Comparing the performance of different LLMs**: Researchers compare the performance of ChatGPT and LLaMA when dealing with the same questions, and analyze their respective advantages and disadvantages. Through the above research, the paper aims to comprehensively evaluate the application effects of large language models in the field of software development, especially how they affect the behavior patterns and technical support methods of the developer community. This not only helps to understand the limitations of LLM in practical applications, but also provides an important reference for future technological development.