Abstract:For over a decade, machine learning (ML) models have been making strides in computer vision and natural language processing (NLP), demonstrating high proficiency in specialized tasks. The emergence of large-scale language and generative image models, such as ChatGPT and Stable Diffusion, has significantly broadened the accessibility and application scope of these technologies. Traditional predictive models are typically constrained to mapping input data to numerical values or predefined categories, limiting their usefulness beyond their designated tasks. In contrast, contemporary models employ representation learning and generative modeling, enabling them to extract and encode key insights from a wide variety of data sources and decode them to create novel responses for desired goals. They can interpret queries phrased in natural language to deduce the intended output. In parallel, the application of ML techniques in materials science has advanced considerably, particularly in areas like inverse design, material prediction, and atomic modeling. Despite these advancements, the current models are overly specialized, hindering their potential to supplant established industrial processes. Materials science, therefore, necessitates the creation of a comprehensive, versatile model capable of interpreting human-readable inputs, intuiting a wide range of possible search directions, and delivering precise solutions. To realize such a model, the field must adopt cutting-edge representation, generative, and foundation model techniques tailored to materials science. A pivotal component in this endeavor is the establishment of an extensive, centralized dataset encompassing a broad spectrum of research topics. This dataset could be assembled by crowdsourcing global research contributions and developing models to extract data from existing literature and represent them in a homogenous format. A massive dataset can be used to train a central model that learns the underlying physics of the target areas, which can then be connected to a variety of specialized downstream tasks. Ultimately, the envisioned model would empower users to intuitively pose queries for a wide array of desired outcomes. It would facilitate the search for existing data that closely matches the sought-after solutions and leverage its understanding of physics and material-behavior relationships to innovate new solutions when pre-existing ones fall short.

NLP meets Materials Science: Quantifying the presentation of materials data in scientific literature

Materials science in the era of large language models: a perspective

From Text to Insight: Large Language Models for Materials Science Data Extraction

Are LLMs Ready for Real-World Materials Discovery?

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT

Advancing materials science through next-generation machine learning

Looking through glass: Knowledge discovery from materials science literature using natural language processing

Quantitative estimation of diphtheria and tetanus toxoids. 3. Comparative assays in mice and in guinea-pigs of two tetanus toxoid preparations.

Exploring large language models for microstructure evolution in materials

A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification

Analyzing Research Trends in Inorganic Materials Literature Using NLP

Polymetis:Large Language Modeling for Multiple Material Domains

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Materials Data toward Machine Learning: Advances and Challenges

Harnessing the Materials Project for machine-learning and accelerated discovery

Audacity of huge: overcoming challenges of data scarcity and data quality for machine learning in computational materials discovery

Functional Material Systems Enabled by Automated Data Extraction and Machine Learning