Prompting LLMs to Compose Meta-Review Drafts from Peer-Review Narratives of Scholarly Manuscripts

Shubhra Kanti Karmaker Santu,Sanjeev Kumar Sinha,Naman Bansal,Alex Knipper,Souvika Sarkar,John Salvador,Yash Mahajan,Sri Guttikonda,Mousumi Akter,Matthew Freestone,Matthew C. Williams Jr
2024-02-24
Abstract:One of the most important yet onerous tasks in the academic peer-reviewing process is composing meta-reviews, which involves understanding the core contributions, strengths, and weaknesses of a scholarly manuscript based on peer-review narratives from multiple experts and then summarizing those multiple experts' perspectives into a concise holistic overview. Given the latest major developments in generative AI, especially Large Language Models (LLMs), it is very compelling to rigorously study the utility of LLMs in generating such meta-reviews in an academic peer-review setting. In this paper, we perform a case study with three popular LLMs, i.e., GPT-3.5, LLaMA2, and PaLM2, to automatically generate meta-reviews by prompting them with different types/levels of prompts based on the recently proposed TELeR taxonomy. Finally, we perform a detailed qualitative study of the meta-reviews generated by the LLMs and summarize our findings and recommendations for prompting LLMs for this complex task.
Machine Learning,Neural and Evolutionary Computing,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem discussed in this paper is the use of large language models (LLMs) to automatically generate draft meta-reviews of academic papers. In the peer-review process of academia, meta-review is an important but time-consuming task that involves understanding the opinions of multiple experts, extracting the key contributions, strengths and weaknesses of the paper, and forming a comprehensive summary. With the development of generative AI technology, particularly the advancements in LLMs, researchers are interested in the potential of these models in automated meta-review creation. The paper presents a case study where three popular language models, GPT-3.5, LLaMA2, and PaLM2, are used to generate meta-reviews using different types of prompts based on the TELeR classification. They then conducted a thorough qualitative analysis to evaluate the quality of the meta-reviews generated by LLMs and compared them with meta-reviews written by experienced researchers. The study found that GPT-3.5 and PaLM2 perform similarly in terms of overall quality of the generated meta-reviews, and both outperform LLaMA2. PaLM2 excels in recall, while GPT-3.5 performs better in precision. However, in macro-level evaluation, although GPT-3.5 performs well in micro-tasks, it scores lower in understanding and following complex task requirements, indicating the need for further research. The paper also points out that the performance of LLMs improves from simple prompts to detailed prompts, but in some cases, there is no significant improvement from level 2 to level 3 and level 4 prompts. Additionally, due to the sensitivity of LLMs to prompts, designing appropriate prompts is crucial for achieving optimal performance. Overall, this paper aims to address the challenges of effectively utilizing LLMs in the automated process of meta-review writing and improving the quality and consistency of generated meta-reviews through prompt optimization.