A Bayesian algorithm for sample selection bias correction

Valerio Astuti
DOI: https://doi.org/10.48550/arXiv.2212.09813
2022-12-20
Abstract:In this paper we present a technique to couple non-traditional data with statistics based on survey data, in order to partially correct for the bias produced by non-random sample selections. All major social media platforms represent huge samples of the general population, generated by a self-selection process. This implies that they are not representative of the larger public, and there are problems in extrapolating conclusions drawn from these samples to the whole population. We present an algorithm to integrate these massive data with ones coming from traditional sources, with the properties of being less extensive but more reliable. This integration allows to exploit the best of both worlds and reach the detail of typical "big data" sources and the representativeness of a carefully designed sample survey.
Methodology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the correction of sample selection bias in non - random samples. Specifically, the author proposes a Bayesian algorithm to integrate traditional survey data with non - traditional big data (such as social media data) to partially correct the sample bias generated by the self - selection process. This involves several key aspects: 1. **Sample Bias Problem**: Users on social media platforms are generated by a self - selection process, so these samples cannot represent the population, and the conclusions drawn from these samples are difficult to generalize to the entire group. The paper points out that although social media data analysis provides a large amount of detailed information at a low cost, they have the problem of being under - representative. 2. **Integrating Different Data Sources**: The paper proposes a method to make full use of the advantages of two data sources by combining large - scale but potentially unrepresentative data (such as social media data) with small - scale but more reliable data (such as traditional survey data). This method aims to achieve the level of detail of the "big data" source and the representativeness of a well - designed sample survey. 3. **Application of the Bayesian Algorithm**: The author introduces an algorithm based on the Bayesian framework for estimating the population distribution. The algorithm uses the known sample selection probabilities and sample distributions, as well as some prior knowledge of the population distribution (such as the mean or other statistical characteristics), to infer the most likely population distribution. 4. **Theoretical Framework and Application**: The paper details the theoretical framework, including how to construct the prior distribution through the maximum entropy principle and how to use the Lagrange multiplier method to solve the optimization problem. In addition, the author also verifies the effectiveness of this method through simulated data sets and real Twitter data sets, showing the performance of this algorithm in different situations. In summary, the main objective of this paper is to provide an effective technical means to reduce or correct the bias caused by the non - random sample selection process, thereby improving the accuracy and reliability of extracting information from new data sources such as social media.