WisPerMed at BioLaySumm: Adapting Autoregressive Large Language Models for Lay Summarization of Scientific Articles

Tabea M. G. Pakull,Hendrik Damm,Ahmad Idrissi-Yaghir,Henning Schäfer,Peter A. Horn,Christoph M. Friedrich
2024-09-23
Abstract:This paper details the efforts of the WisPerMed team in the BioLaySumm2024 Shared Task on automatic lay summarization in the biomedical domain, aimed at making scientific publications accessible to non-specialists. Large language models (LLMs), specifically the BioMistral and Llama3 models, were fine-tuned and employed to create lay summaries from complex scientific texts. The summarization performance was enhanced through various approaches, including instruction tuning, few-shot learning, and prompt variations tailored to incorporate specific context information. The experiments demonstrated that fine-tuning generally led to the best performance across most evaluated metrics. Few-shot learning notably improved the models' ability to generate relevant and factually accurate texts, particularly when using a well-crafted prompt. Additionally, a Dynamic Expert Selection (DES) mechanism to optimize the selection of text outputs based on readability and factuality metrics was developed. Out of 54 participants, the WisPerMed team reached the 4th place, measured by readability, factuality, and relevance. Determined by the overall score, our approach improved upon the baseline by approx. 5.5 percentage points and was only approx 1.5 percentage points behind the first place.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of scientific publications in the biomedical field being difficult for non-professional readers to understand. Specifically, the research team WisPerMed participated in the BioLaySumm2024 shared task, which aims to generate summaries from complex scientific literature that are easy for non-professionals to understand. To achieve this goal, the researchers used large language models (LLMs), particularly the BioMistral and Llama3 models, and fine-tuned them through various methods, including instruction fine-tuning, few-shot learning, and prompt variants tailored to specific contextual information. Experimental results indicate that fine-tuning generally performs best on most evaluation metrics, and few-shot learning significantly enhances the model's ability to generate relevant and factually accurate text. Additionally, the research developed a Dynamic Expert Selection (DES) mechanism to optimize text output selection based on readability and factual accuracy. Ultimately, the WisPerMed team achieved 4th place in the competition, with their method improving by approximately 5.5 percentage points over the baseline and trailing the first place by only about 1.5 percentage points.