Ten simple rules for building and maintaining a responsible data science workflow
Sara Stoudt,Yacine Jernite,Brandeis Marshall,Ben Marwick,Malvika Sharan,Kirstie Whitaker,Valentin Danchev
DOI: https://doi.org/10.1371/journal.pcbi.1012232
2024-07-19
PLoS Computational Biology
Abstract:The beginning of a research project is often full of energy and promise. At this stage, it can be hard to properly assess the ethical implications of a research project before a team has collaboratively set the overarching goals and decided on its next steps. Issues like the input data not being as representative as a team initially thought, or others, like overgeneralizing the findings such that they make inadvisable recommendations to a vulnerable population, could start to appear. Therefore, it is important to embed checkpoints in the early planning stage for the research teams to seriously reflect on the unintended consequences of their work. Early reflection can happen while the research team conducts a literature review as part of their preliminary work to learn about the current state of the art and consider how to place their new idea. As the team reads about other projects that have approached a similar problem to the one they are interested in solving, they could be prompted to categorize past projects in terms of types of negative impacts they have the potential to impose. For example, are there any privacy concerns that arise from an effort to make input data openly available, or is there performance bias of a predictive algorithm applied to human decisions that could lead to unfair outcomes for different people? Beyond the academic literature, what disaster stories have been heard related to the type of data or approach the team is considering, perhaps in the news or collected in books like Algorithms of Oppression : How Search Engines Reinforce Racism [29], Race After Technology [30], and Weapons of Math Destruction [4]? Could the described incidents reappear in the proposed project? Research teams can even learn from the entertainment that its members consume. What dystopian future could result from the work? Experts in data-related and technology fields have even started to bridge the gap between traditional dystopian worlds and specialized scenarios that are informed by the work they do (e.g., [30–32]). As Skirpan and Yeh warn: "with the blinding light of promise glistening, we must be careful not to miss that there are consequences and dangers" [33]. They advocate for a speculative analysis of the field, mixing ideas from formal risk analysis with those of speculative fiction. Similarly, Gaskins advocates for taking inspiration from Afrofuturism creatives and speculative designers to question algorithms [34]. If the algorithm is designed for use by an "average" user, how do atypical users fare? Are predictive algorithms just as accurate for data points representing all demographics? This idea of constant questioning, even from the beginning, is emphasized in Marshall's book, Data Conscience : Algorithmic Siege on our Humanity , which connects the principles of data, technology, and human ethics and outlines key motivating questions to consider [2]. Disasters aren't the only thing to think about; seemingly innocuous decisions can have biases baked in and lead to unintended consequences. For example, suppose you are in charge of collecting data to inform a policy change about the maximum building height allowed in a neighborhood. You may look at the heights of buildings that are listed in prior permits over time, keep track of how limits in the legislation have changed, and release a survey about preferences for people who live in the neighborhood. So far, this scenario might seem pretty straightforward and low risk for ethical complications. However, let's dig in a bit more. What about the people who cannot afford to live in the neighborhood but commute there for work? The commute may take up considerable free time and so they like to take advantage of the green space nearby their office building to eat their lunch and get some fresh air. Higher buildings might block the sun and make that space inhospitable for plants, wildlife, and lunch eaters alike. You won't know about these people's preferences though because you only surveyed people who live in the area. Let's also consider who the policy makers have been in this area. Are their demographics and stances reflective of the population? Who has been pushed out of this neighborhood by previous changes in policy, and how might that affect what you see in the building height data? By making your decision solely based on information that you have access to in the historical record, you may be perpetuating historical biases. Going through expansive reading, reflection, and questioning process, in scenarios big and small, not only helps avoid unintended consequences in the future but can also make the intended audience or user base that the team is responsible to more concrete early on.
biochemical research methods,mathematical & computational biology