S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata, and Document

Kareem Shaik,Dali Wang,Weijian Zheng,Qinglei Cao,Heng Fan,Peter Schwartz,Yunhe Feng
2024-03-16
Abstract:The understanding of large-scale scientific software poses significant challenges due to its diverse codebase, extensive code length, and target computing architectures. The emergence of generative AI, specifically large language models (LLMs), provides novel pathways for understanding such complex scientific codes. This paper presents S3LLM, an LLM-based framework designed to enable the examination of source code, code metadata, and summarized information in conjunction with textual technical reports in an interactive, conversational manner through a user-friendly interface. S3LLM leverages open-source LLaMA-2 models to enhance code analysis through the automatic transformation of natural language queries into domain-specific language (DSL) queries. Specifically, it translates these queries into Feature Query Language (FQL), enabling efficient scanning and parsing of entire code repositories. In addition, S3LLM is equipped to handle diverse metadata types, including DOT, SQL, and customized formats. Furthermore, S3LLM incorporates retrieval augmented generation (RAG) and LangChain technologies to directly query extensive documents. S3LLM demonstrates the potential of using locally deployed open-source LLMs for the rapid understanding of large-scale scientific computing software, eliminating the need for extensive coding expertise, and thereby making the process more efficient and effective. S3LLM is available at
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the challenges in understanding large-scale scientific computing software. Specifically: 1. **Diverse Codebases**: Large-scale scientific computing software often includes multiple programming languages (such as Fortran, Pascal, etc.), which presents significant difficulties for modern programmers to understand the code. 2. **Massive Code Volume**: These software systems may contain millions of lines of code, making it very challenging to fully comprehend each segment of the code. 3. **Suboptimal Documentation**: The documentation for these software systems is sometimes not detailed enough, further increasing the difficulty of thoroughly understanding the software. To enhance the understanding of large-scale scientific computing software, existing tools, although capable of code analysis and documentation generation, mainly focus on static code analysis and lack the ability to handle dynamic queries. Additionally, users need to have a certain level of programming knowledge to effectively use these tools. ### S3LLM's Solution S3LLM is a framework based on large language models (LLM) designed to address the aforementioned problems in the following ways: 1. **Natural Language Queries**: S3LLM allows users to make queries using natural language without requiring deep programming knowledge. 2. **Multi-Angle Analysis**: S3LLM can handle source code, code metadata, and textual technical reports, providing a comprehensive understanding of the software. 3. **Efficient Querying**: By automatically converting natural language queries into domain-specific query languages (such as FQL), S3LLM can efficiently scan and parse the entire code repository. 4. **Flexible Model Selection**: S3LLM offers different parameter scales of the LLaMA-2 model (7B, 13B, 70B), allowing users to choose the appropriate model based on their needs. 5. **Open Source Tool**: S3LLM is an open-source tool, ensuring broad accessibility and practicality in various scientific computing applications. ### Main Contributions - **Design and Implementation of S3LLM**: Proposes a new framework that leverages LLM to enhance the understanding of large-scale scientific software. - **User-Friendly Interface**: Utilizes natural language processing technology, enabling users to easily query and understand scientific software even without programming knowledge. - **Flexible Model Selection**: Provides different scales of the LLaMA-2 model to meet the computational needs of different users. - **Experimental Validation**: Demonstrates the effectiveness of S3LLM in analyzing source code, metadata, and textual documents through experiments with the large-scale Energy Exascale Earth System Model (E3SM). - **Open Source Contribution**: Releases S3LLM as an open-source tool, promoting the development of the scientific computing community. In summary, S3LLM combines natural language processing and large language models to provide a user-friendly interface, enabling scientists and developers to more efficiently and intuitively understand and use complex scientific computing software.