ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing?

Edisa Lozić,Benjamin Štular
DOI: https://doi.org/10.3390/fi15100336
2023-10-16
Abstract:Historical emphasis on writing mastery has shifted with advances in generative AI, especially in scientific writing. This study analysed six AI chatbots for scholarly writing in humanities and archaeology. Using methods that assessed factual correctness and scientific contribution, ChatGPT-4 showed the highest quantitative accuracy, closely followed by ChatGPT-3.5, Bing, and Bard. However, Claude 2 and Aria scored considerably lower. Qualitatively, all AIs exhibited proficiency in merging existing knowledge, but none produced original scientific content. Inter-estingly, our findings suggest ChatGPT-4 might represent a plateau in large language model size. This research emphasizes the unique, intricate nature of human research, suggesting that AI's emulation of human originality in scientific writing is challenging. As of 2023, while AI has transformed content generation, it struggles with original contributions in humanities. This may change as AI chatbots continue to evolve into LLM-powered software.
Computation and Language,Artificial Intelligence,Computers and Society,Emerging Technologies,Human-Computer Interaction
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of the scientific writing capabilities of AI chatbots in the fields of humanities and archaeology. Specifically, the authors aim to assess the performance of these AI chatbots through the following two aspects: 1. **The ability to generate correct answers**: Testing whether these AI chatbots can generate correct answers to complex scientific questions. 2. **The ability to generate original scientific contributions**: Testing whether these AI chatbots can generate original scientific content in humanities research. To achieve this, the authors designed an interdisciplinary case study that combines archaeology, history, linguistics, and genetic history. They created two one-time prompts that posed complex scientific questions and input these questions into six AI chatbots (ChatGPT-3.5, ChatGPT-4, Bard, Bing Chatbot, Aria, and Claude 2) as well as two ChatGPT-4 plugins (Bing and ScholarAI). The generated content was compared with each other and with human-generated content. Through this testing, the authors hope to provide a benchmark for the rapid development of AI chatbots and discuss the impact and potential applications of these technologies in the future of the humanities.