A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Yu Zhang,Xiusi Chen,Bowen Jin,Sheng Wang,Shuiwang Ji,Wei Wang,Jiawei Han
2024-09-29
Abstract:In many scientific fields, large language models (LLMs) have revolutionized the way text and other modalities of data (e.g., molecules and proteins) are handled, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one or two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 260 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at <a class="link-external link-https" href="https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to provide a comprehensive perspective to examine large - language models (LLMs) in the scientific field and their applications in scientific research. Specifically, it aims to: 1. **Reveal cross - domain and cross - modal connections**: The paper emphasizes that existing surveys on scientific LLMs usually focus on one or two specific domains or a single modality, while this paper attempts to provide a more comprehensive view of the research landscape by showing the connections of architectures and pre - training techniques between different domains and modalities. 2. **Summarize and analyze scientific LLMs**: A comprehensive survey of more than 260 scientific LLMs was conducted, their commonalities and differences were discussed, and the pre - training datasets and evaluation tasks for each domain and modality were summarized. 3. **Explore the applications of LLMs in scientific discovery**: It studied how LLMs can be deployed to promote scientific discovery, including applications in hypothesis generation, theorem proving, experimental design, drug discovery, and weather forecasting. 4. **Provide resources and support**: Resources related to the survey were provided, such as the project link on GitHub (https://github.com/yuzhimanhua/Awesome - Scientific - Language - Models), so that readers can further understand and use these models. Through the above goals, the paper hopes to more accurately depict the connections between different scientific LLMs, show their commonalities, and possibly guide future design and development.