Abstract:The amount of data has growing significance in exploring cutting-edge materials and a number of datasets have been generated either by hand or automated approaches. However, the materials science field struggles to effectively utilize the abundance of data, especially in applied disciplines where materials are evaluated based on device performance rather than their properties. This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science. We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR (Findable, Accessible, Interoperable, Reusable) dataset with 91.8% F1-score and extended the dataset with data published since its release. The produced data is formatted and normalized, enabling its direct utilization as input in subsequent data analysis. This feature empowers materials scientists to develop models by selecting high-quality review articles within their domain. Additionally, we designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs). Our results demonstrate comparable performance to traditional machine learning methods without feature selection, highlighting the potential of LLMs to acquire scientific knowledge and design new materials akin to materials scientists.

What problem does this paper attempt to address?

This paper attempts to address the issue of underutilization of data in materials science research, particularly in application fields where materials are often evaluated based on device performance rather than their intrinsic properties. Specifically, the paper proposes a new natural language processing (NLP) task—Structured Information Inference (SII), aimed at efficiently extracting complex information related to device performance from scientific literature. ### Main Issues 1. **Underutilization of Data**: Despite the generation of a large amount of data in the field of materials science, effective utilization of this data remains challenging. In application disciplines, materials are primarily evaluated based on device performance rather than a comprehensive understanding of their properties and behavior. 2. **Information Extraction Challenges**: Efficiently extracting relevant information from a vast amount of unstructured scientific literature is a significant challenge. This not only hinders a comprehensive understanding of material candidates and their properties but also limits the identification of future applications. 3. **Limitations of Existing Methods**: Existing natural language processing techniques, such as Named Entity Recognition (NER), although excellent at extracting entities, still fall short in extracting information at the device level due to the complex relationships between each material and entity within a device. ### Solutions 1. **Introduction of the SII Task**: The paper proposes a new NLP task—Structured Information Inference (SII), which encompasses mainstream NLP tasks such as Named Entity Recognition (NER), Entity Resolution (ER), Relation Extraction (RE), and Information Inference (II). By fine-tuning the GPT-3 model, the goal of efficiently extracting complex information at the device level is achieved. 2. **Dataset Expansion**: The research team expanded the existing FAIR dataset of perovskite solar cells by adding new data since the dataset's release and formatting and standardizing it for subsequent data analysis. 3. **Experimental Validation**: Through designed experiments, the paper demonstrates how large language models (LLMs) can be used to predict the electrical performance of solar cells and design materials or devices with target parameters. The results show that LLMs, without feature selection, perform comparably to traditional machine learning methods, showcasing the potential of LLMs in acquiring scientific knowledge and designing new materials. ### Significance - **Improving Data Utilization**: Through the SII task, materials scientists can more effectively utilize data from a large amount of scientific literature, accelerating the process of material discovery and design. - **Automated Data Processing**: LLMs can automatically extract high-quality data from review articles, reducing the time and cost of manual annotation. - **Advancing Materials Science**: This method provides new tools and approaches for data-driven research in the field of materials science, helping to drive innovation and development in the field.

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Creation of a structured solar cell material dataset and performance prediction using large language models

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

Polymetis:Large Language Modeling for Multiple Material Domains

Materials science in the era of large language models: a perspective

From Text to Insight: Large Language Models for Materials Science Data Extraction

A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification

14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon

The Future of Molecular Studies Through the Lens of Large Language Models.

Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research

Comparative Study of Large Language Model Architectures on Frontier

From Tokens to Materials: Leveraging Language Models for Scientific Discovery

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Exploring large language models for microstructure evolution in materials

MaScQA: Investigating Materials Science Knowledge of Large Language Models

Leveraging large language models for predictive chemistry

NLP meets Materials Science: Quantifying the presentation of materials data in scientific literature

Towards Development of Automated Knowledge Maps and Databases for Materials Engineering using Large Language Models

Advancing materials science through next-generation machine learning