Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks

Samy Ateia,Udo Kruschwitz
2024-07-18
Abstract:Commercial large language models (LLMs), like OpenAI's GPT-4 powering ChatGPT and Anthropic's Claude 3 Opus, have dominated natural language processing (NLP) benchmarks across different domains. New competing Open-Source alternatives like Mixtral 8x7B or Llama 3 have emerged and seem to be closing the gap while often offering higher throughput and being less costly to use. Open-Source LLMs can also be self-hosted, which makes them interesting for enterprise and clinical use cases where sensitive data should not be processed by third parties. We participated in the 12th BioASQ challenge, which is a retrieval augmented generation (RAG) setting, and explored the performance of current GPT models Claude 3 Opus, GPT-3.5-turbo and Mixtral 8x7b with in-context learning (zero-shot, few-shot) and QLoRa fine-tuning. We also explored how additional relevant knowledge from Wikipedia added to the context-window of the LLM might improve their performance. Mixtral 8x7b was competitive in the 10-shot setting, both with and without fine-tuning, but failed to produce usable results in the zero-shot setting. QLoRa fine-tuning and Wikipedia context did not lead to measurable performance gains. Our results indicate that the performance gap between commercial and open-source models in RAG setups exists mainly in the zero-shot setting and can be closed by simply collecting few-shot examples for domain-specific use cases. The code needed to rerun these experiments is available through GitHub.
Computation and Language
What problem does this paper attempt to address?
The main problem this paper attempts to address is whether open-source large language models (LLMs) can perform on par with commercial models in biomedical tasks. Specifically, the researchers participated in the 12th BioASQ challenge, a competition focused on biomedical semantic question answering. Through this platform, they explored the performance of current GPT models (including Claude 3 Opus, GPT-3.5-turbo, and Mixtral 8x7b) under zero-shot and few-shot learning as well as QLoRa fine-tuning. Additionally, the study investigated whether adding relevant knowledge from Wikipedia to the LLM's context window could improve its performance. The study found that in the 10-shot setting, the Mixtral 8x7b model was quite competitive, regardless of whether it was fine-tuned; however, it performed poorly in the zero-shot setting. QLoRa fine-tuning and Wikipedia context did not bring significant performance improvements. The results indicate that the performance gap between commercial and open-source models is mainly evident in the zero-shot setting, and this gap can be narrowed by collecting a small number of domain-specific sample examples. These findings are significant for handling sensitive data in enterprise and clinical applications, as open-source models can be self-hosted, avoiding the risk of sending data to third parties.