24 Prediction and Ethics of Data Science
Summary
Data that allows one or more variables to predict another variable does not show that the predictor variables cause the response variable to change.
This is summarized in the maxim: “Correlation is not causation”.
The existence of a confounding variable that affects both predictors and response can cause this effect.
Knowing that this happens is important for those trying to use data in life affecting decisions such as making bank loans or deciding parole for inmates.
When making decisions, it is important to be aware of the data surrounding that decision. Before starting a new business at a particular location, it would be wise to know the cost of building and the foot traffic at the location.
Still, it is also important to be aware of the limitations of data.
24.1 Correlation is not causation
Given a dataset, it is easy to determine if two numerical variables are correlated. For instance, in the mtcars dataset, each observation is a particular model of car. The mpg variable is the miles per gallon received by the car, and gear is the number of gears in the transmission. The correlation between these two variables can be found with the cor function.
## r_mpg_gear
## 1 0.4802848
This correlation is positive, indicating the the gear would be useful for predicting the mpg value. But correlation is symmetric.
## r_mpg_gear r_gear_mpg
## 1 0.4802848 0.4802848
So it could also be said that mpg could be used to predict the gear of the car. This is also correct.
Very high or very low correlations indicate that one variable can be used to help predict the value of another variable.
What is not true is that one of these variables must be causing the value of the other. Suppose that the gear value was in some way causing the value of the mpg variable. Then it is not true that the mpg variable is causing the gear value. But the correlations are the same!
Very high or very low correlations do not show that one variable value is causing another variable value to be high or low.
On the other hand, a lack of correlation does show that one value is not affecting the other.
Correlations near zero indicate that one variable is not causing another variable to change its value.
24.2 Confounding variables
Suppose an experiment is run where children from elementary school have two measurements taken. First, the size of their feet is recorded, and second, they take a spelling test.
Then there will be positive correlation between the spelling ability of the children and the size of their feet. However, this is due to the existence of a confounding variable, age, which affects both the size of their feet, and their spelling ability.
A confounding variable is one that causes two variables that are correlated to change.
Unfortunately, it is not always easy to see what the confounding variable is for a particular dataset. So any hypothesis that one variable is affecting another has to be aware of confounding variables and try to measure them whenever possible.
24.3 Looking for causation
So is it ever possible to show that one variable causes another to change? The gold standard for this type of evidence is the randomized experiment.
In a randomized experiment individuals are divided uniformly into subgroups that receive different treatments.
Because the choice of treatment does not depend on other factors, the results of the treatment should on average not depend on those other factors. Only the change resulting from the treatment comes through.
Unfortunately, for some questions, such as whether or not smoking causes cancer, it would be highly unethical to break people into two groups, one of which did not smoke, and the other which was required to down a pack a day.
The FDA required warnings on cigarettes to include statements like “Smoking causes head and neck cancer” and “Tobacco smoke causes fatal lung disease in nonsmokers.” Correlation data alone is insufficient to make statements like this.
Instead, first a physical chain of cause and effect was established to answer the question of how cigarettes caused cancer. Then animal testing (where randomized experiments can be used) was used to establish that the events in the physical chain did actually happen in living creatures. It was painstaking work which took many years to complete.
24.4 An example: parole boards
Consider an inmate in California that is serving a life sentence with the possibility of parole. Under California law (Cal. Code Regs., tit. 15, §2281) a parole board is allowed to consider “all relevant, reliable information available to the panel”.
A parole board does not wish to release applicants who are likely to commit further offenses. But this raises the natural question of whether predictor variables should be used in such deliberations.
After all, the variables used are often merely predictive, and not causal. So does the presence of a bad predictor value, often which is out of the control of the inmate, mean that someone should be denied their freedom?
These are the types of questions that need to be considered when thinking about issues of ethics in data science. One can build a model, but should the model be used for a particular decision?
24.5 Ethics
There are many aspects to consider when dealing with an ethical question in Data Science. Perhaps the two most important are thinking about what ethical ideas are coming into play, and then how can the practice of data science that you are undertaking assist in meeting your ethical ideals.
Most of this course has been about the practice of DS, how to maintain data in tidy form for easy retrieval, how to model and visualize data, and how to communicate results effectively. None of these address the questions of when you should refrain from doing a particular analysis or why the results might be problematic.
Ethical goals in data science include (but are not limited to!) the following.
Treating individuals fairly based on the data that surround their situation.
Getting informed consent for data collection whenever possible.
Understand unintended consequences of data use in a model.
Keeping data private and secure.
Determine what data is reliable and correct.
24.6 Resources
The Association of Computing Machinery (ACM) lays out a set of principles that those doing work in computing should follow at https://www.acm.org/code-of-ethics.
The American Statistical Association similarly presents a list of best practices for statisticians at https://www.amstat.org/asa/files/pdfs/EthicalGuidelines.pdf
The Institute of Electrical and Electronics Engineers (IEEE) also has guidelines for its members at https://www.ieee.org/about/compliance.html.
The UK has created a Data Science checklist, a simple tool for those thinking about the ethics of their particular project. It can be found at https://www.gov.uk/government/publications/data-ethics-framework/data-ethics-framework-2020.
Loukides, Mason, and Patil have created a simple thirteen point checklist for those considering a new application or product in their text Ethics and Data Science. It can be downloaded from GitHub at https://github.com/oreillymedia/ethics-datascience.