Incremental Development of PyMC Models
PyMC is a powerful tool for doing Bayesian statistics, but getting started can be intimidating. This article presents an example that I think is a good starting place, and demonstrates a method I use to develop and test models incrementally.
Games like hockey and soccer are well-modeled by a Poisson process, which assumes that a goal can happen at any time, with an average rate of mu goals per game. Under this assumption, we can use a gamma-Poisson model to estimate the rate. Here are the pieces of this model:
- If mu is known, the number of goals in a game follows a Poisson distribution with the parameter mu.
- However, mu is usually unknown. In that case, we can use a gamma distribution to represent possible values of mu and their probabilities.
If you are not familiar with this model, I think it will become clearer as we implement it. We’ll start by generating random samples from the Poisson and gamma distributions.
Random samples
A gentle way to get started with PyMC is to use it as a random number generator. For example, here’s how we generate 1000 values from a Poisson distribution with mean 2.4:
import pymc3 as pm
mu = 2.4
sample_poisson = pm.Poisson.dist(mu).random(size=1000)
And here’s what the results look like, plotted as a histogram:
If a team scores 2.4 goals per game, on average, this is the distribution of goals they will score over a large number of games. We expect them to score 2 goals most often, 1 or 3 goals less often, and other values less often still.
Of course, some teams are better than others, so we expect the goal-scoring rate to vary from one team to the next. I’ll use a gamma distribution to represent the distribution of rates among the teams. Here’s how we can use PyMC to generate a sample from a gamma distribution.
alpha = 4.6
beta = 1.9
sample_gamma = pm.Gamma.dist(alpha, beta).random(size=1000)
I chose the parameters, alpha and beta, so the distribution is consistent with goal-scoring in the National Hockey League (NHL). Here’s what it looks like, using KDE to estimate the distribution of the sample.
For most teams, the goal-scoring rate is between 1 and 4 goals per game.
It’s not common to use PyMC as a random number generator, but it’s a good way to start because we can check the results. For many distributions, there are at least two ways to represent the parameters. Generating these samples is a good way to make sure PyMC interprets the parameters as we expect.
Making a PyMC model
A PyMC model is an object that represents distributions and connections between them. To construct the model, we instantiate pm.Model and then use a with a statement to instantiate the distributions. Here’s what it looks like for the gamma-Poisson model.
with pm.Model() as model:
mu = pm.Gamma('mu', alpha, beta)
goals = pm.Poisson('goals', mu)
The first line creates the model, the second line creates a gamma distribution with the given parameters, and the third line creates a Poisson distribution. Notice that the parameter of the Poisson distribution is mu, which is the object that represents the gamma distribution. That establishes the relationship between the distributions, which we can represent with a directed graph:
This figure indicates that a goal follows a Poisson distribution with parameter mu, and mu follows a Gamma distribution.
The prior distribution
Soon we will use this model for inference; that is, we will use an observed number of goals to estimate the value of mu. But first, we’ll run the model forward; that is, we will generate a sample of goals based on our assumptions about mu.
The first step is to use sample_prior_predictive, like this:
with model:
trace = pm.sample_prior_predictive(1000)
Because we run sample_prior_predictive inside the with statement, it has access to the model. It generates samples of mu and goals, and stores them in an object called a trace. We can extract the samples of mu from the trace like this:
sample_prior = trace['mu']
And here’s what it looks like:
If that looks familiar, there’s a reason; we already generated a sample from this distribution. But there’s a difference now: in the context of the model, this is the prior distribution, which represents our belief about mu before we see the data. In the hockey example, this is what we believe about the goal-scoring rate of a randomly chosen team.
The prior predictive distribution
sample_prior_predicitve also generates a sample from the distribution of goals scored, which we can extract from the trace like this:
sample_prior_pred = trace['goals']
And here’s what it looks like:
This histogram shows the number of goals we expect a randomly chosen team to score in a single game. It is wider than the histogram we generated with a known value of mu because it represents two sources of uncertainty: we don’t know what mu is, and even if we did, we would not know how many goals would be scored.
The prior predictive distribution is useful for checking the validity of the model. If you are familiar with NHL hockey, you can check whether this distribution seems reasonable. I think it does, which means that the model is valid and we have implemented it in PyMC as intended.
Inference
The next step is to use the model for inference. Suppose we watch a game and one of the teams scores four goals. We can use this data to estimate the goal-scoring rate for that team against their opponent. And we can do it with only two small changes to the model.
For comparison, here’s the code we used to generate the prior predictive distribution:
with pm.Model() as model:
mu = pm.Gamma('mu', alpha, beta)
goals = pm.Poisson('goals', mu)
trace = pm.sample_prior_predictive(1000)
And here’s the version we can use to estimate mu:
with pm.Model() as model:
mu = pm.Gamma('mu', alpha, beta)
goals = pm.Poisson('goals', mu, observed=4)
trace = pm.sample(1000)
There are only two differences:
- In the definition of goals, we provide the data using a keyword argument called observed.
- Instead of calling sample_prior_predictive, we call sample.
sample is the workhorse of PyMC; it provides implementations of several MCMC algorithms, which generate samples from the posterior distribution.
From the resulting trace, we can extract the sampled values of mu.
sample_posterior = trace['mu']
Here’s what their distribution looks like, compared to the prior.
This result is called the posterior distribution because it represents what we believe about mu after seeing the data. In this example, the team scored more goals than the mean of the prior distribution, so in the posterior distribution, values above the mean are more likely and values below the mean are less likely.
The mean of this posterior is close to 3, which is more than the mean of the prior (2.4) but less than the number of goals scored (4). On the basis of one game, we think this team is better than average, but we don’t necessarily think their goal-scoring rate is 4.
The posterior predictive distribution
At this point, we have a posterior distribution that represents what we believe about the average goal-scoring rate, but in a single game, the team might score more or less than their average. To see what the distribution of goals would be in the next game, we can use sample_posterior_predictive, like this:
with model:
post_pred = pm.sample_posterior_predictive(trace)
For each value of mu in the trace, this function runs the model forward, drawing a sample from a Poisson distribution with the given mean. Here’s what the results look like.
In the prior predictive distribution, the most likely outcome was one goal. In the posterior predictive distribution, it is two goals, and three goals are almost as likely. And that makes sense: having seen the team score four goals, we have reason to think their goal-scoring rate is better than average, so we expect them to score more goals than average in the next game.
The predictive framework
In this example, we generated samples from four distributions. The following diagram shows how they are related.
- We start with a prior that represents what we believe about the goal-scoring rate for a random team.
- Then we generate the prior predictive distribution, which indicates the number of goals we expect a random team to score. Checking this distribution helps validate the model.
- Using data, we can do a Bayesian update; the result is the posterior distribution, which represents what we believe about the observed team.
- Based on the posterior distribution, we can generate the posterior predictive distribution, which predicts the number of goals we expect the observed team to score in a future game.
When you are getting started, it can be hard to keep these distributions straight. One reason I like this example is that the goal-scoring rate is continuous and the number of goals is discrete. I use a line graph to represent the prior and posterior distributions, and histograms to represent the predictive distributions, which makes it a little easier to keep them straight.
This article is based on the first example in my workshop, Bayesian Inference with PyMC, which I will offer for the first time on August 17, 2021. In the workshop, we will extend this example to two teams, in order to predict the outcome of future games. We’ll also implement a beta-binomial model, which is used to estimate proportions; as one example, we’ll work with survival data from different hospitals. Finally, we’ll make both models hierarchical, which makes it possible to update our beliefs about many teams, or many hospitals, at the same time.
About the author:
Allen Downey is a Professor of Computer Science at Olin College of Engineering in Needham, MA. He is the author of several books related to computer science and data science, including Think Python, Think Stats, Think Bayes, and Think Complexity. Prof Downey has taught at Colby College and Wellesley College, and in 2009 he was a Visiting Scientist at Google. He received his Ph.D. in Computer Science from U.C. Berkeley, and M.S. and B.S. degrees from MIT.