Scaling Up Diffusion and Flow-based XGBoost Models

Jesse C. Cresswell,Taewoo Kim
2024-08-29
Abstract:Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at <a class="link-external link-https" href="https://github.com/layer6ai-labs/calo-forest" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of computational resource limitations encountered when using XGBoost as a function approximator for tabular data generation in diffusion and flow - matching models. Specifically, the authors focus on: 1. **Excessively high memory consumption of existing implementations**: The original method is very memory - intensive even when dealing with small - scale data sets, and it is difficult to scale to larger - scale data sets required for scientific applications. 2. **Low algorithm efficiency**: The original method fails to fully utilize the advantages of XGBoost, resulting in a significant increase in training time and memory usage, especially when dealing with high - dimensional data. 3. **Limited model performance**: Due to resource limitations, the model cannot be fully scaled, thus affecting its performance on benchmark tasks. To solve these problems, the authors made the following improvements: - **Optimized implementation**: By redesigning the algorithm implementation, the memory requirement is reduced from approximately a square relationship with the data set size to a linear relationship, and the memory overhead is significantly reduced. - **Algorithm improvement**: Multi - output trees are introduced to more effectively represent high - dimensional joint distributions, and early stopping is used to improve model performance and prevent overfitting. - **Extended application**: The feasibility of these improvements on large - scale scientific data sets is verified, especially for the Fast Calorimeter Simulation Challenge in experimental particle physics. ### Summary The core objective of the paper is to optimize XGBoost - based diffusion and flow - matching models through engineering and technical means, enabling them to handle larger - scale tabular data and exhibit better performance in practical applications.