Abstract:In this paper we present a technique to couple non-traditional data with statistics based on survey data, in order to partially correct for the bias produced by non-random sample selections. All major social media platforms represent huge samples of the general population, generated by a self-selection process. This implies that they are not representative of the larger public, and there are problems in extrapolating conclusions drawn from these samples to the whole population. We present an algorithm to integrate these massive data with ones coming from traditional sources, with the properties of being less extensive but more reliable. This integration allows to exploit the best of both worlds and reach the detail of typical "big data" sources and the representativeness of a carefully designed sample survey.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the correction of sample selection bias in non - random samples. Specifically, the author proposes a Bayesian algorithm to integrate traditional survey data with non - traditional big data (such as social media data) to partially correct the sample bias generated by the self - selection process. This involves several key aspects: 1. **Sample Bias Problem**: Users on social media platforms are generated by a self - selection process, so these samples cannot represent the population, and the conclusions drawn from these samples are difficult to generalize to the entire group. The paper points out that although social media data analysis provides a large amount of detailed information at a low cost, they have the problem of being under - representative. 2. **Integrating Different Data Sources**: The paper proposes a method to make full use of the advantages of two data sources by combining large - scale but potentially unrepresentative data (such as social media data) with small - scale but more reliable data (such as traditional survey data). This method aims to achieve the level of detail of the "big data" source and the representativeness of a well - designed sample survey. 3. **Application of the Bayesian Algorithm**: The author introduces an algorithm based on the Bayesian framework for estimating the population distribution. The algorithm uses the known sample selection probabilities and sample distributions, as well as some prior knowledge of the population distribution (such as the mean or other statistical characteristics), to infer the most likely population distribution. 4. **Theoretical Framework and Application**: The paper details the theoretical framework, including how to construct the prior distribution through the maximum entropy principle and how to use the Lagrange multiplier method to solve the optimization problem. In addition, the author also verifies the effectiveness of this method through simulated data sets and real Twitter data sets, showing the performance of this algorithm in different situations. In summary, the main objective of this paper is to provide an effective technical means to reduce or correct the bias caused by the non - random sample selection process, thereby improving the accuracy and reliability of extracting information from new data sources such as social media.

A Bayesian algorithm for sample selection bias correction

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Integrating representative and non-representative survey data for efficient inference

Sampling techniques for big data analysis in finite population inference

Bayesian Nonparametric Weighted Sampling Inference

Correcting Sociodemographic Selection Biases for Population Prediction from Social Media

Bayesian Uncertainty Estimation Under Complex Sampling

Multiple bias-calibration for adjusting selection bias of non-probability samples using data integration

Towards Bayesian Data Selection

Addressing sample selection bias for machine learning methods

Sample Selection Bias in Machine Learning for Healthcare

Estimation of Population Size from Biased Samples Using Non-Parametric Binary Regression

Data Integration by combining big data and survey sample data for finite population inference

On Using Bayesian Methods to Address Small Sample Problems

A Bayesian approach on sample size calculation for comparing means.

Bias correction models for electronic health records data in the presence of non-random sampling

Fully Bayesian estimation under informative sampling

Efficient estimation and correction of selection-induced bias with order statistics

Bayesian Safety Surveillance with Adaptive Bias Correction

Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys