Treeffuser: Probabilistic Predictions via Conditional Diffusions with Gradient-Boosted Trees

Nicolas Beltran-Velez,Alessandro Antonio Grande,Achille Nazaret,Alp Kucukelbir,David Blei
2024-10-22
Abstract:Probabilistic prediction aims to compute predictive distributions rather than single point predictions. These distributions enable practitioners to quantify uncertainty, compute risk, and detect outliers. However, most probabilistic methods assume parametric responses, such as Gaussian or Poisson distributions. When these assumptions fail, such models lead to bad predictions and poorly calibrated uncertainty. In this paper, we propose Treeffuser, an easy-to-use method for probabilistic prediction on tabular data. The idea is to learn a conditional diffusion model where the score function is estimated using gradient-boosted trees. The conditional diffusion model makes Treeffuser flexible and non-parametric, while the gradient-boosted trees make it robust and easy to train on CPUs. Treeffuser learns well-calibrated predictive distributions and can handle a wide range of regression tasks -- including those with multivariate, multimodal, and skewed responses. We study Treeffuser on synthetic and real data and show that it outperforms existing methods, providing better calibrated probabilistic predictions. We further demonstrate its versatility with an application to inventory allocation under uncertainty using sales data from Walmart. We implement Treeffuser in <a class="link-external link-https" href="https://github.com/blei-lab/treeffuser" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **probabilistic prediction from tabular data**. Specifically, the author proposes a new method named **Treeffuser** to generate prediction distributions rather than a single point prediction. This method can help practitioners quantify uncertainty, calculate risks, and detect outliers. #### Core problems: 1. **Limitations of existing methods**: Most existing probabilistic prediction methods assume parametric response distributions (such as Gaussian or Poisson distributions). When these assumptions do not hold, the model will produce poor predictions and poorly calibrated uncertainty. 2. **The need to handle complex distributions**: Data in the real world often has complex distribution characteristics, such as multimodal, inflated, multivariate, and skewed distributions. Traditional probabilistic prediction methods are difficult to handle these situations effectively. #### Goals of Treeffuser: - **Flexibility and non - parametric**: Treeffuser uses a conditional diffusion model, enabling it to flexibly adapt to various complex conditional distributions without making too many assumptions about the form of the response distribution. - **Efficiency and robustness**: By using Gradient - Boosted Trees (GBTs), Treeffuser can be trained quickly on the CPU and performs well when dealing with large - scale datasets. - **Accuracy**: Treeffuser outperforms existing methods on multiple benchmark datasets, providing better - calibrated probabilistic predictions, including more accurate quantile estimates and accurate mean predictions. #### Application scenarios: - **Industrial process optimization**: For example, how a manufacturing factory adjusts operations based on raw material properties, operation processes, temperature, etc. to reduce emissions and maximize profits. - **Inventory management**: Using Walmart sales data for inventory allocation under uncertain conditions shows the potential of Treeffuser in practical applications. In conclusion, Treeffuser aims to provide a powerful and flexible new tool for probabilistic prediction of tabular data, especially for tasks that need to handle complex distributions.