Causal AI: From Data to Action

ODSC - Open Data Science
5 min readJun 13, 2024

At this year’s ODSC East, there was a wealth of content related to cutting-edge technologies and emerging techniques that are the latest developments in artificial intelligence. While much progress has been made in predictive tasks such as generating the next word in a sequence, the technology accomplishing these tasks rarely provides insight into the causal nature of the data being processed. Dr. Andre Franca addressed this shortfall of prediction-focused AI in a thought-provoking hour-long talk that outlined key principles that should guide data scientists’ approach to causality.

Get your ODSC Europe 2024 pass today!

In-Person and Virtual Conference

September 5th to 6th, 2024 — London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

REGISTER NOW

One of the most rewarding aspects of being a data scientist is the exploratory nature of drawing conclusions from data and seeing those insights guide decisions. Sometimes odd conclusions can be derived, such as a positive relationship between price and sales or between discount usage and customer churn. Odd conclusions can be especially prevalent in observational studies, where randomized control trials are impractical or impossible. Isolating the true causal relationship from the various factors affecting the outcome is at the core of the field of causal inference and a necessary step to ensure that derived insights can be trusted by decision-makers.

To illustrate the difficulty associated with determining causal relationships, Dr. Franca described some historical examples. The attempts of early medical researchers to find a cure for scurvy were repeatedly thwarted by false conclusions. Despite observing the effectiveness of lemon and orange consumption in preventing the disease, expeditions used more readily available limes which were much less effective. The false conclusion that limes were a suitable treatment was supported by the incorrect deduction that acidity was the key factor in lemon’s effect in preventing the disease. Dr. Franca used this example to explain the concept of a mediator variable, one that affects the outcome in its association with an independent variable.

The false conclusion that limes were a suitable treatment was additionally supported by the negative correlation between lime consumption and incidents of scurvy over periods where journeys became shorter. The speaker identified the perceived success of limes in reducing scurvy as an example of confounding bias; faster ships reduced the time at sea, a key factor in scurvy incidence.

To effectively control for factors that may obscure causal relationships in the data, Dr. Franca explained the importance of starting with one’s current fundamental understanding of the relationships before building models. While objectivity in a data-first approach is valuable, a simplified approach that is based in the foundational understanding of the causal relationships may be more effective. This principle suggests that highly parameterized models that are optimized for predictive accuracy by capturing complex correlations are likely to be poor causal tools.

Returning to some of the odd conclusions that can be derived without careful analysis, Dr. Franca distinguished the types of variables that should be controlled. Controlling for confounding variables is necessary in instances where a variable affects both the treatment and the outcome. For example, the conclusion that lower prices led to lower sales can easily be confounded by a competitor’s price changes. In that example, the competitor’s price and other related variables should be included in the analysis, so the price’s effect on sales can be isolated from the other factors.

In contrast to confounding variables, collider variables are factors that are affected by the treatment and the outcome, but do not impact either of them. Unlike confounding variables, collider variables should not be controlled. Dr. Franca gave the example of the relationship between soft skills and technical skills among employees; when only observing the relationship among employees at a particular company, the relationship was negative because the company only hired people with great technical or great soft skills. The relationship in the population as a whole was positive, so the spurious conclusion that soft skills are negatively associated with technical skills was induced by only evaluating hired employees.

While machine learning and AI tools can help mitigate bias in causal studies, their usage requires a very different approach compared to their application in predictive modeling. Dr. Franca explained that an estimator’s ability to generate accurate predictions is not a sufficient prerequisite to being useful for effective causal measurements. The speaker recommended that data scientists distinguish between estimators that describe joint distributions (i.e. components occurring together rather than independently) from those that model the effect of an intervention. Models optimized for predictive accuracy help identify when to expect the outcome Y (e.g. sales) in the presence of a variable X (e.g. promotion amount), but aren’t the best tool for estimating the expected Y due to the change in X.

To best estimate causal effects, Dr. Franca recommended that practitioners avoid blindly applying machine learning models to their data. Many different causal interpretations are possible from the same dataset, so data scientists must use their domain knowledge of how variables are causally related to properly define the right model. This approach excludes the numerous relationships that could explain the observed data, but are infeasible given our human understanding of the world.

ODSC West 2024 tickets available now!

In-Person & Virtual Data Science Conference

October 29th-31st, 2024 — Burlingame, CA

Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!

REGISTER NOW

Deriving causal insights from high dimensional data requires the thoughtful approach outlined by the speaker; human domain expertise plays a key role in avoiding spurious results. The talk complemented more AI-focused content at ODSC East as diverse training for any data scientist. The next opportunity to stay current with the latest technology is ODSC Europe September 5th-6th or ODSC West October 29th-31st!

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.