UniTox: Leveraging LLMs to Curate a Unified Dataset of Drug-Induced Toxicity from FDA Labels

Jake Silberg,Kyle Swanson,Elana Simon,Angela Zhang,Zaniar Ghazizadeh,Scott Ogden,Hisham Hamadeh,James Zou
DOI: https://doi.org/10.1101/2024.06.21.24309315
2024-06-22
Abstract:Drug-induced toxicity is one of the leading reasons new drugs fail clinical trials. Machine learning models that predict drug toxicity from molecular structure could help researchers prioritize less toxic drug candidates. However, current toxicity datasets are typically small and limited to a single organ system (e.g., cardio, renal, or liver). Creating these datasets often involved time-intensive expert curation by parsing drug label documents that can exceed 100 pages per drug. Here, we introduce UniTox, a unified dataset of 2,418 FDA–approved drugs with drug–induced toxicity summaries and ratings created by using GPT–4o to process FDA drug labels. UniTox spans eight types of toxicity: cardiotoxicity, liver toxicity, renal toxicity, pulmonary toxicity, hematological toxicity, dermatological toxicity, ototoxicity, and infertility. This is, to the best of our knowledge, the largest such systematic human in vivo database by number of drugs and toxicities, and the first covering nearly all FDA–approved medications for several of these toxicities. We recruited clinicians to validate a random sample of our GPT–4o annotated toxicities, and UniTox toxicity ratings concord with clinician labelers 87–96% of the time. Finally, we benchmark a graph neural network trained on UniTox to demonstrate the utility of this dataset for building molecular toxicity prediction models.
Health Informatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the toxicity problem caused by drugs, especially how to predict drug toxicity from molecular structures through machine - learning models to help researchers prioritize drug candidates with lower toxicity. However, current toxicity data sets are usually small in scale and limited to a single organ system (such as the heart, kidney or liver). Creating these data sets often requires a great deal of time, with experts manually parsing label documents of more than 100 pages for each drug. For this reason, the authors introduced **UniTox**, a unified data set that contains 2,418 FDA - approved drugs and their drug - induced toxicity summaries and ratings. These data were generated by processing FDA drug labels using GPT - 4. UniTox covers eight types of toxicity: cardiotoxicity, hepatotoxicity, nephrotoxicity, pulmonary toxicity, hematotoxicity, cutaneous toxicity, ototoxicity and infertility. ### Main contributions: 1. **Constructing the UniTox data set**: By using large - language models (LLMs) to quickly classify drug toxicity in FDA labels, a large cross - toxicity data set in humans containing 2,418 FDA - approved drugs was constructed. 2. **Verifying accuracy**: Compared with existing data sets, UniTox has achieved a significant improvement in accuracy, with a compliance rate of up to 87 - 96% with human - annotated data. 3. **Clinical verification**: Clinical doctors were recruited to verify random samples, further confirming the reliability of UniTox. 4. **Model performance evaluation**: Graph neural networks (GNNs) were trained using UniTox, demonstrating the practicality of this data set in constructing molecular toxicity prediction models. ### Method overview: 1. **Data collection and pre - processing**: 2,418 drugs and their labels were screened from the FDALabel database, and drugs with local, lavage and intradermal administration routes were removed. 2. **Generating toxicity ratings**: Using GPT - 4 and the chain - of - thought method, toxicity ratings were generated through a two - layer prompt system. The first - layer prompt requires the model to summarize information about specific types of toxicity in the drug label, and the second - layer prompt requires the model to generate ternary (none / less / most) or binary (none / yes) toxicity ratings based on these summaries. 3. **External data set verification**: Verification and comparison were carried out with three FDA - designed data sets, DICTrank, DILIrank and DIRIL, to evaluate the accuracy of UniTox. 4. **Clinical doctor verification**: For five types of toxicity without existing verification data, clinical doctors were invited to conduct manual verification on 100 randomly sampled drugs. ### Results: 1. **UniTox data set**: It contains eight toxicity types of 2,418 drugs. Each drug has a toxicity summary generated by GPT - 4, ternary and binary toxicity ratings, and the SPL ID used to generate the data. 2. **Verification results**: The comparison and verification with DICTrank, DILIrank and DIRIL show that UniTox is significantly superior to existing methods in accuracy, especially in high - confidence prediction. 3. **Clinical doctor verification**: 87 - 96% of the drugs were considered to be accurately rated by clinical doctors, revealing some marginal cases and potential improvement directions for the model. 4. **GNN model performance**: The Chemprop - RDKit model trained with UniTox performs well in a multi - task setting and can achieve relatively high ROC - AUC values for different toxicity types. ### Conclusion: UniTox is a large - scale, multi - toxicity data set. Through the use of large - language models and verification by clinical doctors, its practicality and reliability in drug toxicity prediction have been proven. This provides a valuable resource for future drug research and development and helps to improve the success rate and safety of clinical trials.