Causal Inference: An Indispensable Set of Techniques for Your Data Science Toolkit
Data scientists often get asked questions of the form “Does X Drive Y”: (1) did recent PR coverage drive sign-ups, (2) does customer support increase sales, or (3) did improving the recommendation model drive revenue? Supporting company stakeholders requires every data scientist to learn techniques that can answer questions like these, which are centered around issues of causality, and are solved with causal inference.
[Related article: Watch: Project Feels — Deep Text Models for Sentiment Analysis]
Typically, to answer questions of the form, “Does X Drive Y”, data scientists start with the raw correlation. They look at whether X associates with Y by plotting the two variables in a scatterplot and also examine the correlation between the two variables using hypothesis testing and/or regression. However, as we know from statistics, correlation is not causation because of confounding variables that can mask the true relationship between X and Y.
Here is an example from our work at Coursera of this issue. We have a mobile app that some learners choose to download and use, and we might wonder whether using the mobile app leads to increased learner retention from month to month. If so we should invest in encouraging every learner to download and use the mobile app as a way to drive engagement.
Historically, we see that using the mobile app correlates with a 5% increase in month to month retention on the platform, but there is a potential confounder of selection bias. Maybe more intrinsically motivated learners are both more likely to retain from month to month on Coursera and more likely to use the mobile app. The impact of this confounder is not apparent in the historical data because we have no easy way to measure a learner’s intrinsic motivation.
The most common and easiest way to isolate causal effects is to just run an AB test where we randomly assign a group of learners a particular experience and another group a different experience (which is typically the current version). The experience a learner sees is uncorrelated with any potential confounders due to this random assignment, and we can cleanly measure the causal effect of the experience on an outcome of interest as the difference between groups.
AB testing though cannot be done in every case. There are specifically four main limitations to AB testing: (1) quality of the product experience, (2) ethical concerns, (3) customer trust, and (4) technical feasibility.
Going back to our mobile app example, designing an AB test there would require us to limit access of the mobile app to a randomly chosen set of learners. This is not a great product experience, nor is it technically feasible to limit access to the app.
In these cases where we cannot AB test, this is where causal inference comes in. Using causal inference techniques we can infer causal impacts using historical data without the need to run experiments. The central idea in causal inference is that we try to control for all possible confounders in historical data and look for natural sources of variation that can split the data into quasi-random groups, mimicking the randomization we would get from AB testing.
As a more specific example related to the impact of mobile app usage, we could email a randomly chosen set of learners about the mobile app experience, encouraging them to download and use it.
This splits the data into two groups: one that got the email and one that did not. The group that received the email will also be more likely to use the mobile app if our email nudging them was successful, and we can then measure the relationship between receiving the email (which associates with mobile app usage) and customer retention, allowing us to isolate the causal effect of mobile app usage on retention through mimicking the random assignment we would get from an AB test. This is known as an instrumental variables approach in a randomized encouragement trial.
Causal inference techniques like these open the door to extracting maximum value from historical data and enable the answering of critical business and product questions. Data Scientists that master causal inference have a valuable set of tools that can expand the ways they add value to an organization and are something everyone should have in their back pocket.
[Other Speaker Blog: Automating Data Wrangling — The Next Machine Learning Frontier]
Want to learn more about key causal inference techniques, including those at the intersection of machine learning and causal inference? Attend ODSC West 2019 and join Vinod’s talk, “An Introduction to Causal Inference in Data Science.”
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.