Abstract:The understanding of large-scale scientific software poses significant challenges due to its diverse codebase, extensive code length, and target computing architectures. The emergence of generative AI, specifically large language models (LLMs), provides novel pathways for understanding such complex scientific codes. This paper presents S3LLM, an LLM-based framework designed to enable the examination of source code, code metadata, and summarized information in conjunction with textual technical reports in an interactive, conversational manner through a user-friendly interface. S3LLM leverages open-source LLaMA-2 models to enhance code analysis through the automatic transformation of natural language queries into domain-specific language (DSL) queries. Specifically, it translates these queries into Feature Query Language (FQL), enabling efficient scanning and parsing of entire code repositories. In addition, S3LLM is equipped to handle diverse metadata types, including DOT, SQL, and customized formats. Furthermore, S3LLM incorporates retrieval augmented generation (RAG) and LangChain technologies to directly query extensive documents. S3LLM demonstrates the potential of using locally deployed open-source LLMs for the rapid understanding of large-scale scientific computing software, eliminating the need for extensive coding expertise, and thereby making the process more efficient and effective. S3LLM is available at

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the challenges in understanding large-scale scientific computing software. Specifically: 1. **Diverse Codebases**: Large-scale scientific computing software often includes multiple programming languages (such as Fortran, Pascal, etc.), which presents significant difficulties for modern programmers to understand the code. 2. **Massive Code Volume**: These software systems may contain millions of lines of code, making it very challenging to fully comprehend each segment of the code. 3. **Suboptimal Documentation**: The documentation for these software systems is sometimes not detailed enough, further increasing the difficulty of thoroughly understanding the software. To enhance the understanding of large-scale scientific computing software, existing tools, although capable of code analysis and documentation generation, mainly focus on static code analysis and lack the ability to handle dynamic queries. Additionally, users need to have a certain level of programming knowledge to effectively use these tools. ### S3LLM's Solution S3LLM is a framework based on large language models (LLM) designed to address the aforementioned problems in the following ways: 1. **Natural Language Queries**: S3LLM allows users to make queries using natural language without requiring deep programming knowledge. 2. **Multi-Angle Analysis**: S3LLM can handle source code, code metadata, and textual technical reports, providing a comprehensive understanding of the software. 3. **Efficient Querying**: By automatically converting natural language queries into domain-specific query languages (such as FQL), S3LLM can efficiently scan and parse the entire code repository. 4. **Flexible Model Selection**: S3LLM offers different parameter scales of the LLaMA-2 model (7B, 13B, 70B), allowing users to choose the appropriate model based on their needs. 5. **Open Source Tool**: S3LLM is an open-source tool, ensuring broad accessibility and practicality in various scientific computing applications. ### Main Contributions - **Design and Implementation of S3LLM**: Proposes a new framework that leverages LLM to enhance the understanding of large-scale scientific software. - **User-Friendly Interface**: Utilizes natural language processing technology, enabling users to easily query and understand scientific software even without programming knowledge. - **Flexible Model Selection**: Provides different scales of the LLaMA-2 model to meet the computational needs of different users. - **Experimental Validation**: Demonstrates the effectiveness of S3LLM in analyzing source code, metadata, and textual documents through experiments with the large-scale Energy Exascale Earth System Model (E3SM). - **Open Source Contribution**: Releases S3LLM as an open-source tool, promoting the development of the scientific computing community. In summary, S3LLM combines natural language processing and large language models to provide a user-friendly interface, enabling scientists and developers to more efficiently and intuitively understand and use complex scientific computing software.

S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata, and Document

LMs: Understanding Code Syntax and Semantics for Code Analysis

LLMs for science: Usage for code generation and data analysis

A Survey on Large Language Models for Software Engineering

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Software Service Engineering in the Era of Large Language Models

A Survey on Large Language Models for Code Generation

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

Large Language Models for Software Engineering: Survey and Open Problems

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Large Language Models for Software Engineering: A Systematic Literature Review

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

An Interdisciplinary Outlook on Large Language Models for Scientific Research

Materials science in the era of large language models: a perspective

AI-powered Code Review with LLMs: Early Results

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

Scientific Large Language Models: A Survey on Biological & Chemical Domains