Language Interaction Network for Clinical Trial Approval Estimation

Chufan Gao,Tianfan Fu,Jimeng Sun
2024-04-26
Abstract:Clinical trial outcome prediction seeks to estimate the likelihood that a clinical trial will successfully reach its intended endpoint. This process predominantly involves the development of machine learning models that utilize a variety of data sources such as descriptions of the clinical trials, characteristics of the drug molecules, and specific disease conditions being targeted. Accurate predictions of trial outcomes are crucial for optimizing trial planning and prioritizing investments in a drug portfolio. While previous research has largely concentrated on small-molecule drugs, there is a growing need to focus on biologics-a rapidly expanding category of therapeutic agents that often lack the well-defined molecular properties associated with traditional drugs. Additionally, applying conventional methods like graph neural networks to biologics data proves challenging due to their complex nature. To address these challenges, we introduce the Language Interaction Network (LINT), a novel approach that predicts trial outcomes using only the free-text descriptions of the trials. We have rigorously tested the effectiveness of LINT across three phases of clinical trials, where it achieved ROC-AUC scores of 0.770, 0.740, and 0.748 for phases I, II, and III, respectively, specifically concerning trials involving biologic interventions.
Biomolecules,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main focus of this paper is the prediction of the success rate of clinical trials, which is a crucial aspect in drug development and helps optimize trial planning and investment decision-making. Traditional machine learning models usually use drug descriptions, molecular properties, and disease conditions to predict trial results, but they face difficulties in handling complex data such as biologics. Biologics are a rapidly growing treatment method, with less clear molecular properties compared to traditional small molecule drugs. The paper proposes a new approach called "Language Interaction Network" (LINT), which predicts trial results solely based on the free-text description of clinical trials. LINT utilizes pre-trained language models such as BERT and combines drug information with relevant medical codes (ICD codes) to predict the results of Phase I, II, and III clinical trials. In clinical trials involving biologics, LINT achieves ROC-AUC scores of 0.770, 0.740, and 0.748 in different stages, demonstrating better performance than traditional models. Furthermore, LINT is interpretable and can explain model decisions by using Shapley values, highlighting the most important parts that influence the prediction of input text. Compared to previous work, LINT uses a larger dataset, including small molecule drugs and biologics, and is capable of handling complex text and tabular data. The paper also discusses existing challenges such as limited training data and diverse trial types, and notes that LINT can serve as an open-source framework adaptable to new pre-trained language models. Future research directions may include unsupervised learning strategies to expand annotated datasets, improve label quality, and create more interpretable models to optimize clinical trial design and increase success rates.