dsld: A Socially Relevant Tool for Teaching Statistics

Taha Abdullah,Arjun Ashok,Brandon Estrada,Norman Matloff,Aditya Mittal
2024-11-07
Abstract:The growing power of data science can play a crucial role in addressing social discrimination, necessitating nuanced understanding and effective mitigation strategies of potential biases. Data Science Looks At Discrimination (dsld) is an R and Python package designed to provide users with a comprehensive toolkit of statistical and graphical methods for assessing possible discrimination related to protected groups, such as race, gender, and age. Our software offers techniques for discrimination analysis by identifying and mitigating confounding variables, along with methods for reducing bias in predictive models. In educational settings, dsld offers instructors powerful tools to teach important statistical principles through motivating real world examples of discrimination analysis. The inclusion of an 80-page Quarto book further supports users, from statistics educators to legal professionals, in effectively applying these analytical tools to real world scenarios.
Methodology,Information Retrieval,Machine Learning,Applications
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to use the power of data science to identify and mitigate discrimination issues related to protected groups (such as race, gender, and age). Specifically, the paper introduces an R and Python software package named "Data Science Looks At Discrimination" (dsld), which aims to provide users with comprehensive statistical and graphical method tools to assess potential discrimination phenomena and provide methods to reduce bias in prediction models. ### Main Objectives: 1. **Detect Discrimination**: By identifying and compensating for confounding variables, for example, whether there is a gender wage gap after considering factors such as age, occupation, and number of working weeks. 2. **Reduce Bias in Prediction**: Reduce the impact of bias in prediction algorithms. For example, in a loan application evaluation tool, if the applicant's race is used as a predictive variable (either explicitly or through proxy variables), how to mitigate its impact. ### Educational Significance: The dsld software package not only provides powerful tools to analyze discrimination issues but also provides opportunities for teachers in educational environments to teach important statistical principles, and stimulates students' interest in learning through real - world discrimination analysis cases. In addition, the accompanying 80 - page Quarto book further supports users, from statistics educators to legal professionals, helping them effectively apply these analysis tools to practical scenarios. ### Technical Implementation: - **Statistical and Graphical Methods**: Including linear models, logistic regression, random forests, etc. - **Non - parametric Regression Models**: Used to handle complex interaction effects and non - linear relationships. - **Visualization Tools**: Such as density plots, conditional difference plots, three - dimensional scatter plots, and parallel coordinate plots, etc., which help users intuitively understand the data analysis results. ### Key Formulas: Suppose we have a response variable \(Y\), a covariate vector \(X\), and a sensitive variable \(S\). The sensitive variable \(S\) can be continuous, binary, or categorical. For new prediction values, the prediction value is represented as \(\hat{Y}\). In the analysis process, the key lies in whether to include the interaction term between the covariate \(X\) and the sensitive variable \(S\) in the model. For example, in a linear model, when the interaction term is not included, we can calculate the mean difference between different groups; when the interaction term is included, the difference will change with the value of \(X\), so the difference needs to be estimated at the user - specified point of interest. \[ \text{Model without interaction term: } Y=\beta_0+\beta_1 X+\beta_2 S+\epsilon \] \[ \text{Model with interaction term: } Y = \beta_0+\beta_1 X+\beta_2 S+\beta_3(X\times S)+\epsilon \] In this way, the dsld software package can analyze the complex relationships between different variables more meticulously, thereby identifying and mitigating potential discrimination issues more accurately.