Abstract:Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at <a class="link-external link-https" href="https://github.com/layer6ai-labs/calo-forest" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of computational resource limitations encountered when using XGBoost as a function approximator for tabular data generation in diffusion and flow - matching models. Specifically, the authors focus on: 1. **Excessively high memory consumption of existing implementations**: The original method is very memory - intensive even when dealing with small - scale data sets, and it is difficult to scale to larger - scale data sets required for scientific applications. 2. **Low algorithm efficiency**: The original method fails to fully utilize the advantages of XGBoost, resulting in a significant increase in training time and memory usage, especially when dealing with high - dimensional data. 3. **Limited model performance**: Due to resource limitations, the model cannot be fully scaled, thus affecting its performance on benchmark tasks. To solve these problems, the authors made the following improvements: - **Optimized implementation**: By redesigning the algorithm implementation, the memory requirement is reduced from approximately a square relationship with the data set size to a linear relationship, and the memory overhead is significantly reduced. - **Algorithm improvement**: Multi - output trees are introduced to more effectively represent high - dimensional joint distributions, and early stopping is used to improve model performance and prevent overfitting. - **Extended application**: The feasibility of these improvements on large - scale scientific data sets is verified, especially for the Fast Calorimeter Simulation Challenge in experimental particle physics. ### Summary The core objective of the paper is to optimize XGBoost - based diffusion and flow - matching models through engineering and technical means, enabling them to handle larger - scale tabular data and exhibit better performance in practical applications.

Scaling Up Diffusion and Flow-based XGBoost Models

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

BUFF: Boosted Decision Tree based Ultra-Fast Flow matching

XGBoost: A Scalable Tree Boosting System

A Simple and Fast Baseline for Tuning Large XGBoost Models

XFlow: Benchmarking Flow Behaviors over Graphs

Learning to Scale Logits for Temperature-Conditional GFlowNets

Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching

Flow Matching for Scalable Simulation-Based Inference

Cross-Domain Graph Data Scaling: A Showcase with Diffusion Models

Semi-surrogate modelling of droplets evaporation process via XGBoost integrated CFD simulations

Survival regression with accelerated failure time model in XGBoost

DP-XGBoost: Private Machine Learning at Scale

XGBoost: Scalable GPU Accelerated Learning

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Distributing the Stochastic Gradient Sampler for Large-Scale LDA.

Scaling New Frontiers: Insights into Large Recommendation Models

Treeffuser: Probabilistic Predictions via Conditional Diffusions with Gradient-Boosted Trees

Gradient Boosting: A Computationally Efficient Alternative to Markov Chain Monte Carlo Sampling for Fitting Large Bayesian Spatio-Temporal Binomial Regression Models

Scalable Probabilistic Forecasting in Retail with Gradient Boosted Trees: A Practitioner's Approach

Histogram-Based Federated XGBoost using Minimal Variance Sampling for Federated Tabular Data