What is the P-Value? Hypothesis Testing and its Ties to Machine Learning
In 1925, mathematician Ronald Fisher posed an interesting problem. At a party, he had a colleague named Muriel Bristol who made an interesting claim. Just tasting a cup of tea, she could detect whether it had the milk poured first, or the tea. Ronald Fisher was intrigued by the claim and brought back 8 cups of tea. 4 had the milk poured first, while the remaining 4 had the tea poured first. He then asked her to sip each one, and identify which one was which.
Remarkably, she identified them all correctly. The probability this would happen by chance is 1 in 70, or 0.01428571. In other words, if she was truly randomly guessing and happened to get the correct answers, this had only a 1.4% chance of occurring.
This probability of 0.014, or what we call 1.4%, is what we call a p-value. Ponder for a moment. What does that tell you? Consider that 1.4% is pretty low, and we are measuring the probability this occurred by random chance. Because it is so low, we are less inclined to believe this was a coincidence (which we frame as the null hypothesis) and perhaps she has a talent (the alternative hypothesis).
Fisher traditionally defined a p-value threshold of .05 or less as statistically significant, meaning that the likelihood is low enough to reject it as random chance (null hypothesis) and entertain another explanation (alternative hypothesis).
These ideas extend critically to much of scientific research. For example, if we were testing a new drug to decrease the duration of the cold, we could use normal distributions to test if the sample test group might be showing an improvement just due to random chance, rather than because the drug had any effect. A drug that is further down the tail in recovery time is more likely to drive down the p-value. This would then mean we are more inclined to believe the drug is working.
We can also calculate the p-value for a linear correlation. The points below look like they follow a line pretty closely. How likely would it be those points arranged themselves randomly to just so happen to be near a line?
The more points you have that more closely follow a line, that will drive the p-value down. The points above actually have a linear correlation p-value of .000005976, which is remarkably low! We are inclined to think the correlation is very real between the x-variable and y-variable.
Notice how there is no absolute certainty. A drug trial may have only a 3.8% chance of coincidental improvement, or Muriel was only 1.4% likely to be guessing. Some data points only have a 0.0005976% chance to have arranged themselves randomly around a line. But we never achieve a 0% chance of any of these being random. As Jim Carrey’s character said in the movie Dumb and Dumber, “So you’re telling me there’s a chance!”
Hypothesis Testing and Machine Learning
Now here’s the kicker: when you do machine learning (including that simple linear regression above), you are in fact searching for hypotheses that identify relationships in the data. When you have thousands or millions of variables, and many types of models to choose from, you are in fact doing hypothesis testing backward in a practice called data mining. This can indeed be powerful, as it allows deep learning to identify pixel patterns that correlate with the label “cow” in an image. But things can still go awry, where the machine learning model found correlations of the label “cow” with green fields rather than the cows themselves. This can cause empty fields to be labeled “cow” upon predicting new images.
On top of that, machine learning can easily become a tool for p-hacking, where we torture the data-finding patterns (and thus low p-values) that are coincidental rather than meaningful. When you have millions of variables, and many model hyperparameters, it is not hard to find extremely low p-values and coincidental patterns. This can be highly problematic in research environments, as confirmation bias can motivate findings.
About the Author/ODSC West 2024 Speaker:
Thomas Nield is a consultant, writer, and instructor. After a decade in the airline industry, he authored two books for O’Reilly Media and regularly teaches classes on artificial intelligence, statistics, machine learning, and optimization algorithms. Currently, he is teaching at the University of Southern California and defining system safety approaches with AI for clients. In another endeavor, Yawman Flight, he is inventing a new category of flight simulation controllers and bringing them to market. He enjoys making technical content relatable and relevant to those unfamiliar with or intimidated by it.