"Oh LLM, I'm Asking Thee, Please Give Me a Decision Tree": Zero-Shot Decision Tree Induction and Embedding with Large Language Models

Ricardo Knauer,Mario Koddenbrock,Raphael Wallsberger,Nicholas M. Brisson,Georg N. Duda,Deborah Falla,David W. Evans,Erik Rodner
2024-09-27
Abstract:Large language models (LLMs) provide powerful means to leverage prior knowledge for predictive modeling when data is limited. In this work, we demonstrate how LLMs can use their compressed world knowledge to generate intrinsically interpretable machine learning models, i.e., decision trees, without any training data. We find that these zero-shot decision trees can surpass data-driven trees on some small-sized tabular datasets and that embeddings derived from these trees perform on par with data-driven tree-based embeddings on average. Our knowledge-driven decision tree induction and embedding approaches therefore serve as strong new baselines for data-driven machine learning methods in the low-data regime.
Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to use large language models (LLMs) to generate interpretable machine - learning models (such as decision trees) in the case of data scarcity. Specifically, the authors show how to generate zero - shot decision trees through the internal world knowledge of LLMs without any training data, and use them as feature representations (embeddings) for downstream models. This method not only improves data privacy protection but also outperforms data - driven decision - tree models on some small tabular datasets. ### Main contributions of the paper: 1. **Zero - shot decision - tree generation**: The authors show how to use state - of - the - art LLMs to generate decision trees without accessing model weights or any training data. This zero - shot setting naturally protects data privacy, thus broadening the applications of LLMs in various industries. 2. **Zero - shot decision - tree embedding**: The authors also show that these zero - shot decision trees can be used as feature representations for downstream models. 3. **Systematic comparison**: The authors conduct a systematic comparison between their decision - tree generation and embedding methods and the existing state - of - the - art machine - learning methods. The results show that in the case of low data volume, their knowledge - driven decision trees outperform data - driven decision trees on 27% of the datasets, and their zero - shot representations are not significantly different from data - driven tree - based embeddings. ### Method overview: - **Zero - shot decision - tree generation**: By designing specific prompt templates, the authors let LLMs generate decision trees according to the given features. These prompt templates include background information, task description, prediction target \( p \) and maximum depth \( d \). LLMs use the world knowledge accumulated during pre - training to convert feature names into decision rules. - **Zero - shot decision - tree embedding**: The generated decision trees are used as feature representations, and the truth values of internal nodes are converted into binary vectors through mapping \( \chi_n \). For a forest composed of multiple decision trees, the final embedding is the concatenation of these binary vectors. ### Experimental results: - On public datasets, the performance of zero - shot decision trees is comparable to that of data - driven decision trees, and even better on some datasets. - On private datasets, the performance of zero - shot decision trees on ACL injury and post - traumatic pain data is also better than that of data - driven baseline models. In conclusion, this paper proposes a novel method of using the knowledge of LLMs to generate interpretable decision trees, and shows its effectiveness and superiority in the case of low data volume. This provides a new direction for future research, especially in fields such as medicine where data is scarce but interpretability is highly required.