Annotating Materials Science Text: A Semi-Automated Approach for Crafting Outputs with Gemini Pro

Hasan M Sayeed,Trupti Mohanty,Taylor Sparks
DOI: https://doi.org/10.26434/chemrxiv-2024-173dp
2024-02-28
Abstract:Recent advancements in large language models (LLMs) have paved the way for automated information extraction in the materials science domain. However, fine-tuning these models, crucial for effective machine learning pipelines in materials science, is hindered by a lack of pre-annotated data. Manual annotation, a laborious process, exacerbates the challenge. To address this, we introduce a tailored semi-automated annotation process, using Google's Gemini Pro language model. Our approach focuses on two key tasks: extracting information in structured JSON format and generating abstractive summaries from materials science texts. The collaborative process, a symbiotic effort between human annotators and the LLM, driven by structured prompts and user-guided examples, enhances the annotation quality and augments the LLM's capacity to comprehend materials science intricacies. Importantly, it streamlines human annotation efforts by leveraging the LLM's proficient starting point.
Chemistry
What problem does this paper attempt to address?
This paper addresses the automation challenges in information extraction in the field of materials science. Despite the progress made by large language models (LLMs) in automated information extraction, the lack of pre-annotated data hinders the effective application of these models. The manual annotation process is time-consuming and labor-intensive, becoming a bottleneck. The paper proposes a semi-automated annotation method that utilizes Google's Gemini Pro language model to improve the quality of annotations and enhance the LLM's understanding of the complexity of materials science through structured prompts and user-guided examples. This approach involves extracting structured JSON-format information and generating summaries from materials science texts, thereby speeding up data extraction, reducing human annotation work, and supporting the construction of materials science databases. The paper also discusses the challenges of evaluation methods and demonstrates the potential of this method in improving efficiency and accuracy.