OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Akari Asai,Jacqueline He,Rulin Shao,Weijia Shi,Amanpreet Singh,Joseph Chee Chang,Kyle Lo,Luca Soldaini,Sergey Feldman,Mike D'arcy,David Wadden,Matt Latzke,Minyang Tian,Pan Ji,Shengyan Liu,Hao Tong,Bohao Wu,Yanyu Xiong,Luke Zettlemoyer,Graham Neubig,Dan Weld,Doug Downey,Wen-tau Yih,Pang Wei Koh,Hannaneh Hajishirzi
2024-11-21
Abstract:Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on how to effectively use large - language models (LLMs) to assist researchers in comprehensively processing the ever - increasing scientific literature. Specifically, the paper aims to: 1. **Improve the quality and accuracy of literature synthesis**: By developing a specialized retrieval - enhanced language model (OPENSCHOLAR), which can generate high - quality answers with citation support on the basis of identifying relevant literature fragments, thus helping researchers better understand existing research, discover new research directions, improve methodologies, and support evidence - based decision - making. 2. **Address the limitations of existing large - language models**: Existing large - language models face some challenges when dealing with scientific literature synthesis tasks, such as generating untrue information (hallucinations), relying on outdated pre - training data, and lacking a transparent citation mechanism. These limitations result in insufficient accuracy and reliability of the models in generating scientific literature reviews. 3. **Provide a large - scale multidisciplinary evaluation benchmark**: In order to more comprehensively evaluate the performance of scientific literature synthesis models, the authors constructed a large - scale multidisciplinary benchmark test set named SCHOLAR QABENCH. This benchmark test set contains 2,967 expert - written query questions and 208 long - form answers, covering multiple fields such as computer science, physics, neuroscience, and biomedical science. Through this benchmark test, the performance of different models in literature retrieval and synthesis capabilities can be systematically evaluated. 4. **Promote open - science research**: By open - sourcing all code, models, data storage, data, and providing public demonstrations, the authors hope to promote the openness and reproducibility of scientific research, enabling more researchers to use and improve OPENSCHOLAR, thereby driving progress in the scientific field. Overall, the goal of this paper is to address the challenges currently faced in scientific literature management by developing an efficient, accurate, and open scientific literature synthesis tool, thereby enhancing the efficiency and quality of scientific research.