9 Common Mistakes That Lead To Data Bias
Data scientists spend a lot of time with data, which by itself is neutral. It only follows that answers gleaned from the data would be neutral too. Even though data is neutral, our responses to data are sometimes filled with bias that can skew our outcomes. Let’s examine some common ways bias can creep into your beautifully done programs.
[Related article: Dewey Defeats Truman: How Sampling Bias can Ruin Your Model]
What Is Bias?
We know our emotional response to bias, but what does it really mean in Machine Learning? Bias is anything that throws your results systematically off the mark from the real picture. It’s a loaded term that can include things like stereotypes and value distortions all in the same definition.
Eliminating forms of bias is critical, not just because stereotyping can be dangerous for specific populations but because anything that skews our picture of reality can be just as detrimental. Algorithms don’t think for themselves. The tools are only as good as we make them.
Types of Data Mistakes That Lead To Bias
Admitting you have a problem is the first step to a solution, right? We all fall prey to illogical thinking, but it’s the unconscious bias that’s dangerous. At one point or another in our human evolution, the types of biases we experience served a purpose (survival, most likely), but now that we have the true data at our fingertips and the ability to process it all, we’ve got to move beyond those evolutionary ticks.
The fundamental dataset could be skewing your results. Your training set must reflect the realities of the environment in which your ML model will run. If you’re training for facial recognition and you only show one particular group (white females, for example), when your algorithm encounters something different (black males), you could experience faulty outcomes. Same if you’re training your self-driving cars on datasets with only daytime.
We’re naturally less critical of data that confirms rather than disproves what we believe. Confirmation bias causes you to interpret results in ways that uphold what your organization already believes to be true and often appears as picking and choosing which results to feature.
Time Will Tell
How much data do you need to be comfortable with a conclusion? Gathering more evidence should increase our confidence that the results are accurate, but sometimes, we run with the preliminary results instead. Prematurely cutting off evidence gathering when results begin to skew another way can help in the short term but provide poor results in the long term.
Also called Systemic Value Distortion, this bias reflects a lack of accuracy in the measurement instrument itself. This bias goes beyond a lack of precision or an abundance of noise. It’s a consistent inaccuracy in the training data that doesn’t reflect the real world environment and will fail in the long run.
This baffling case of statistical bias happens when the results for individual areas versus the aggregate suggest different conclusions. One very famous example happened in the Berkeley grad school case in 1973. The school was afraid of being sued for admission prejudice against women, so it launched an internal investigation. The statistics suggested that women were admitted less than men, but individual departments showed a bias in favor of women. The answered turned out to be a matter of where the majority of men and women were applying, muddying the results.
Overfitting or Underfitting
Overfitting is an overly complicated model that creates a lot of noise. If we’re eager to show off what we can do, we’re in danger of overcomplicating things. However, the opposite is also true. Underfitting is a model too simplistic to show any real insight with the data.
Confounding Variables (Correlation bias)
You’ve seen those funny correlation statistics between two ridiculous things, but this can be a significant problem in datasets. Machines are literal, so each correlation could be significant for them when the confounding variable is left out or unidentified. Your results could show a relationship where there isn’t any.
You’ve got a wide variety in your dataset, but a secondary bias may still exist. Machines are literal and could read correlations between the actions happening in images and the fundamental principles they’re learning to recognize. For example, datasets designed to identify men versus women could have plenty of variety in the types of men and women, but overwhelmingly show women doing some sort of housework and men doing nothing. Machines learn a fundamental part of “women” is housework and a fundamental part of “men” is doing nothing, two erroneous stereotypes.
If you start your data collection with an assumption already in mind, you may be subconsciously skewing your results. You’ll collect the wrong data, ask the wrong questions, select the wrong variables or metrics, even the wrong algorithms. Examine your pre-existing “hunches” or beliefs about the outcome before you get started.
[Related article: The Impact of Racial Bias in Facial Recognition Software]
Dealing With Bias In Machine Learning
We have to understand how bias works to begin mitigating its effect on our outcomes. Relying on “rational data” puts your organization at risk because all ML/AI models have some degree of human influence.
Data selection is usually the culprit, so mindfulness of how we select our training sets and control for the types of biases that tend to creep in. Some of this ability requires a certain amount of real world experience, but data scientists can start from the very beginning looking at results and controlling for outside factors that can cause bias. Without this care, algorithms will only amplify the bias.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.