pystacked: Stacking generalization and machine learning in Stata

Achim Ahrens,Christian B. Hansen,Mark E. Schaffer
DOI: https://doi.org/10.48550/arXiv.2208.10896
2023-03-07
Abstract:pystacked implements stacked generalization (Wolpert, 1992) for regression and binary classification via Python's scikit-learn. Stacking combines multiple supervised machine learners -- the "base" or "level-0" learners -- into a single learner. The currently supported base learners include regularized regression, random forest, gradient boosted trees, support vector machines, and feed-forward neural nets (multi-layer perceptron). pystacked can also be used with as a `regular' machine learning program to fit a single base learner and, thus, provides an easy-to-use API for scikit-learn's machine learning algorithms.
Econometrics,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to implement stacked generalization of machine - learning algorithms in Stata, especially for regression and binary - classification tasks. Specifically, the paper introduces a new program named `pystacked`, which allows users to fit multiple machine - learning algorithms through Python's scikit - learn library and combine these algorithms into a final prediction as a weighted average of these individual predictions. The paper points out that when facing new prediction or classification tasks, it is usually very difficult to determine in advance which machine - learning algorithm is the most suitable. Therefore, as a model - averaging method, stacked generalization can provide better performance than any single learner by combining the predictions of multiple learners. `pystacked` not only supports stacked generalization but can also be used as a regular machine - learning program to fit a single base learner, thus providing an easy - to - use API for scikit - learn machine - learning algorithms. The paper emphasizes the uniqueness of `pystacked` in Stata, that is, it is the first tool to introduce stacked generalization into Stata. In addition, `pystacked` also supports multiple base learners, including regularized regression, random forest, gradient - boosted trees, support vector machines, and feed - forward neural networks (multilayer perceptrons). By using cross - validation to avoid overfitting, `pystacked` can effectively combine different types of pattern - recognition capabilities and improve prediction accuracy.