Abstract:Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on how to effectively use large - language models (LLMs) to assist researchers in comprehensively processing the ever - increasing scientific literature. Specifically, the paper aims to: 1. **Improve the quality and accuracy of literature synthesis**: By developing a specialized retrieval - enhanced language model (OPENSCHOLAR), which can generate high - quality answers with citation support on the basis of identifying relevant literature fragments, thus helping researchers better understand existing research, discover new research directions, improve methodologies, and support evidence - based decision - making. 2. **Address the limitations of existing large - language models**: Existing large - language models face some challenges when dealing with scientific literature synthesis tasks, such as generating untrue information (hallucinations), relying on outdated pre - training data, and lacking a transparent citation mechanism. These limitations result in insufficient accuracy and reliability of the models in generating scientific literature reviews. 3. **Provide a large - scale multidisciplinary evaluation benchmark**: In order to more comprehensively evaluate the performance of scientific literature synthesis models, the authors constructed a large - scale multidisciplinary benchmark test set named SCHOLAR QABENCH. This benchmark test set contains 2,967 expert - written query questions and 208 long - form answers, covering multiple fields such as computer science, physics, neuroscience, and biomedical science. Through this benchmark test, the performance of different models in literature retrieval and synthesis capabilities can be systematically evaluated. 4. **Promote open - science research**: By open - sourcing all code, models, data storage, data, and providing public demonstrations, the authors hope to promote the openness and reproducibility of scientific research, enabling more researchers to use and improve OPENSCHOLAR, thereby driving progress in the scientific field. Overall, the goal of this paper is to address the challenges currently faced in scientific literature management by developing an efficient, accurate, and open scientific literature synthesis tool, thereby enhancing the efficiency and quality of scientific research.

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis

OpenResearcher: Unleashing AI for Accelerated Scientific Research

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

AcademicGPT: Empowering Academic Research

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

IntellectSeeker: A Personalized Literature Management System with the Probabilistic Model and Large Language Model

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs

A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

LitSearch: A Retrieval Benchmark for Scientific Literature Search

Coombs-negative Autoimmune Hemolytic Anemia Followed by Anti-erythropoetin Receptor Antibody-associated Pure Red Cell Aplasia: A Case Report and Review of Literature.

OLMo: Accelerating the Science of Language Models

SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model

Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations

SynAsk: Unleashing the Power of Large Language Models in Organic Synthesis