Using Large Language Models for Expert Prior Elicitation in Predictive Modelling

Alexander Capstick,Rahul G. Krishnan,Payam Barnaghi
2024-11-26
Abstract:Large language models (LLMs), trained on diverse data effectively acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency hinder their direct application for specialised tasks. In fields such as clinical research, acquiring expert annotations or prior knowledge about predictive models is often costly and time-consuming. This study proposes using LLMs to elicit expert prior distributions for predictive models. This approach also provides an alternative to in-context learning, where language models are tasked with making predictions directly. We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation. Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings. Applied to clinical problems, this translates to fewer required biological samples, lowering cost and resources. Prior elicitation also consistently outperforms and proves more reliable than in-context learning at a lower cost, making it a preferred alternative in our setting. We demonstrate the utility of this method across various use cases, including clinical applications. For infection prediction, using LLM-elicited priors reduced the number of required labels to achieve the same accuracy as an uninformative prior by 55%, at 200 days earlier in the study.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the difficult problem of obtaining expert prior knowledge in predictive modeling, especially the problem of how to use large language models (LLMs) to improve model performance in the case of scarce data. Specifically, the research focuses on the following aspects: 1. **Reducing data requirements and costs**: In fields such as clinical research, labeled data is usually very expensive and time - consuming. By using large language models to generate expert prior distributions, the prediction error can be significantly reduced with a small amount of data, thereby reducing the number of biological samples required and reducing costs and resource consumption. 2. **Replacing in - context learning**: Traditional in - context learning methods let language models make direct predictions, but this method has problems such as computational complexity, high cost, and low transparency. The paper proposes a new method, that is, generating prior distributions through language models and comparing them with in - context learning to evaluate their reliability and effectiveness. 3. **Verifying whether language models can accurately generate parameter distributions**: The research explores whether language models can truly generate the parameter distributions of prediction models, and proposes a model selection strategy to evaluate the advantages and disadvantages of in - context learning and prior extraction. 4. **Exploring the Bayesian inference ability of language models**: By providing training examples and extracting the internal prediction posterior distributions of language models, the research team attempts to verify whether language models can perform Bayesian inference and their performance consistency in different tasks. ### Main contributions of the paper - **Proposing a method of using large language models to generate expert prior distributions**: This method can significantly improve the performance of prediction models, especially in the case of scarce data. - **Comparing the effects of prior distributions generated by language models and non - informative prior distributions**: The results show that using prior distributions generated by language models can significantly reduce prediction errors and require less labeled data to achieve the same accuracy. - **Proposing methods for extracting context - prior and posterior distributions from language models**: This enables researchers to gain a deeper understanding of the performance of language models in different tasks and evaluate whether they truly perform Bayesian inference. - **Applying Bayesian factors for model selection**: By comparing the effects of prior extraction and in - context learning, the research finds that prior extraction is a better choice in all tested tasks, especially considering its consistency and lower cost. ### Formula summary - **Prior distribution**: \[ p(\theta | M, T)=\sum_{k = 1}^{K}\pi_k N(\theta | \mu_k,\sigma_k^2) \] where \((\mu_k,\sigma_k)\sim p_{M,T}(\mu,\sigma | I_k)\), \(\pi_k\sim \text{Dir}(1)\) - **Posterior predictive distribution**: \[ p(y | \tilde{x}, D)=\int_{\Theta}\sum_{k = 1}^{K}p(y | \tilde{x},\theta)p(\theta | D, I_k)p(I_k)d\theta \] - **Bayesian factor**: \[ BF(\alpha_0,\alpha_1; D)=\frac{p(D | \alpha_0)}{p(D | \alpha_1)}=\frac{\int_{\Theta}p(\theta_0 | \alpha_0)p(D | \theta_0,\alpha_0)d\theta_0}{\int_{\Theta}p(\theta_1 | \alpha_1)p(D | \theta_1,\alpha_1)d\theta_1} \] Through these methods and formulas, the paper shows how to effectively use large language models to improve predictive modeling, especially in application scenarios with scarce data and cost - sensitive.