A step towards the integration of machine learning and small area estimation
Tomasz Żądło,Adam Chwila
2024-02-12
Abstract:The use of machine-learning techniques has grown in numerous research areas.
Currently, it is also widely used in statistics, including the official
statistics for data collection (e.g. satellite imagery, web scraping and text
mining, data cleaning, integration and imputation) but also for data analysis.
However, the usage of these methods in survey sampling including small area
estimation is still very limited. Therefore, we propose a predictor supported
by these algorithms which can be used to predict any population or
subpopulation characteristics based on cross-sectional and longitudinal data.
Machine learning methods have already been shown to be very powerful in
identifying and modelling complex and nonlinear relationships between the
variables, which means that they have very good properties in case of strong
departures from the classic assumptions. Therefore, we analyse the performance
of our proposal under a different set-up, in our opinion of greater importance
in real-life surveys. We study only small departures from the assumed model, to
show that our proposal is a good alternative in this case as well, even in
comparison with optimal methods under the model. What is more, we propose the
method of the accuracy estimation of machine learning predictors, giving the
possibility of the accuracy comparison with classic methods, where the accuracy
is measured as in survey sampling practice. The solution of this problem is
indicated in the literature as one of the key issues in integration of these
approaches. The simulation studies are based on a real, longitudinal dataset,
freely available from the Polish Local Data Bank, where the prediction problem
of subpopulation characteristics in the last period, with "borrowing strength"
from other subpopulations and time periods, is considered.
Machine Learning,Econometrics,Methodology