Generative artificial intelligence GPT-4 accelerates knowledge mining and machine learning for synthetic biology

Zhengyang Xiao,Wenyu Li,Hannah Moon,Yixin Chen,Garrett W Roell,Yinjie J Tang
DOI: https://doi.org/10.1101/2023.06.14.544984
2023-06-15
bioRxiv
Abstract:Knowledge mining from synthetic biology journal articles for machine learning (ML) applications is a labor-intensive process. The development of natural language processing (NLP) tools, such as GPT-4, can greatly accelerate data extraction for machine learning to predict microbial performance under complex strain engineering and bioreactor conditions. As a proof of concept, GPT-4 was used to extract knowledge from 176 publications, resulting in 2037 data instances uploaded to a crowdsourcing online database. The centralized datasets and feature selection enabled a random forest model to predict fermentation titers of an industrial important yeast (Yarrowia lipolytica) with high accuracy (R2 of 0.86 for unseen test data). Via transfer learning, the trained model could assess production capability of nonconventional yeasts (e.g., Rhodosporidium toruloides). This work showed the potential of generative AI to automate information extraction from research articles and advanced AI applications to facilitate design-build-test-learn (DBTL) for biomanufacturing as well as biotech commercial decisions.
What problem does this paper attempt to address?