Football bayesian

The world cup of football is approaching and we all want to know in advance who is going to be the next winner. We can actually use the data from the previous competition to assign a probability of success for each team…

But before going that far, we can understand some statistical concept with football data.

Let’s say, we want to know the chance of success to score a penalty for an average player. After asking some experts, we end up with a 75 percentage chance to score through a penalty.

We are interested in learning about the location of this probability of success \(p\). We will use the Bayesian viewpoint to estimate the proportion of \(p\).

That’s is, we want to know the most plausible value of our proportion, \(p\), the success of score when the team is offered a penalty.

Our prior belief is that, a player has 75% chance to score. We will use the data of the first league to confirm our believe and draw a plausible interval of \(p\).

What is Baysian Analysis

With Bayesian analysis, whe have belief about the value of the proportion and we have models for our belif in terms of a prior distribution.

After we construct our belief, we can use data to update our believe about the proportion by computing the posterior distribution. We can also be interested in predicting the likely outcomes of a new sample taken from the population.

To construct a bayesian analysis, we need

A prior
A likelihood
A posterior

The Bayesian is the following:

\[ p(A|B) = p(B|A)p(A) \]

In short, we are attempting to ascertain the conditional probability of \(A\) given \(B\) based on the conditional probability of \(B\) given \(A\) and the respective probabilities of \(A\) and \(B\)

The bayesian posterior probability is given by:

\[ Posterior = likelihood*prior \]

In the above formulation, we are trying to obtain the probability of an hypothesis given the evidence at hand (data) and our initial (prior) beliefs regarding that hypothesis

Our prior value is \(p\) = 75%. We will construct the likelihood based on a Beta distribution.

Beta distribution

What is a Beta distribution

The Beta distribution is a continuous probability distribution having two parameters. One of its most common uses is to model one’s uncertainty about the probability of success of an experiment. that is, it represents all the possible values of a probability when we don’t know what that probability is.

In our experiment, only two possible outcomes is plausible:

A success: with probability \(X\)
A failure: with probability \(1-X\)

\(X\) is unknown and all its possible values are deemed equally likely. This uncertainty can be described by assigning to \(X\) a uniform distribution on the interval \([0,1]\). That is, \(X\) can take only values between 0 and 1.

We can say that, no higher weight is, a priori, given to one value of \(X\) compare with another.

In our example, we will construct \(k\) as the number of success and \(n-k\) the failure from \(n\) independant repetitions. And then, we will use the data to update the beta parameters.

We can construct the beta density distribution:

\[ g(p) \propto p^{\alpha-1}(1-p)^{\beta-1} \] Wehre the hyperparameters \(\alpha\) and \(\beta\) are chosen to reflect our prior beliefs about \(p\).

The mean of a beta prior is:

\[ m = \frac{\alpha}{\alpha -\beta} \] and the variance:

\[ \upsilon = \frac{m(1-m)}{(\alpha+\beta+1)} \] It is easier to obtain the value of the hyperparameters through statements about the percentiles of the distribution.

We also ask our football expert and he is confident that the average will be most likely between 0.75, but that it could reasonably range from 0.70 to 0.80.

Our belief is that the median and the 90th percentiles are respectively 0.75 and 0.80

library(LearnBayes)
q1 <- list(p = 0.9, x = 0.80)
q2 <- list(p = 0.5, x = 0.75)
parameter <- beta.select(q1, q2)
a <- parameter[1]
b <- parameter[2]

We see that this prior information is matched with a beta density with \(\alpha = 83.46\) and \(\beta = 28.05\)

We can plot the beta distribution.

library(ggplot2)
library(dplyr)

sim <- data.frame(a = c(83.46, 81+1, 83.46+81),
                  b = c(28.05, 25+1, 28.05+25)) %>%
    group_by(a, b) %>%
    do(data_frame(x = seq(0, 1, .001), y = dbeta(x, .$a, .$b))) %>%
    mutate(Parameters = paste0("\u03B1 = ", a, ", \u03B2 = ", b)) %>%
    ungroup %>%
    mutate(Parameters = factor(Parameters, levels = unique(Parameters)))

sim %>% filter(a == 83.46) %>%
    ggplot(aes(x, y, color = Parameters)) + geom_line() +
    xlim(0.5, .95) + ylab("Density of beta") +
  theme_classic()

## Warning: Removed 551 rows containing missing values (geom_path).

Likelihood

The likelihood function is given by:

\[ L(p) \propto p^{s+1}(1-p)^{f+1} \] with \(0<p<1\)

From the data, we can see 25 players missed the score and 81 goals

## Import data
library(dplyr)
library(ggplot2)
data <- read.csv("penalty_data.csv") %>%
        select(No., Scored) %>%
        mutate(goal = as.numeric(Scored))
## 2 means missed, 1 means goals
table(data$Scored)

## 
## Missed Scored 
##     25     81

#prop.table(table(data$Scored))

We can set \(k\) is the number of sucess and the number of failure as \(n-k\) with \(n\) the number of independant trials.

#N <- length(data$No.)
k <- sum(data$Scored=='Scored')  
N_k <- sum(data$Scored=='Missed')

We can plot the likelihood

sim %>% filter(a == 82) %>%
    ggplot(aes(x, y, color = Parameters)) + geom_line() +
    xlim(0.5, .95) + ylab("Density of beta") +
  theme_classic()

## Warning: Removed 551 rows containing missing values (geom_path).

Posterior

Combining the beta prior with the likelihood function, we can show that the posterior density is also of the beta form with updates parameters \(\alpha +s\) and \(\beta +f\)

\[ g(p|data) \propto p^{\alpha +s-1}(1-p)^{\beta +f-1} \]

sim %>% ggplot(aes(x, y, color = Parameters)) + geom_line() +
    xlim(0.5, .95) + ylab("Density of beta")

## Warning: Removed 1653 rows containing missing values (geom_path).

We can see by adding data to our belief, the curve is now thinner than it used to be. We have a better sense of what the player’s scored average is.

A 90% interval estimate for \(p\) is found by computing the 5th and 95th percentiles of the beta density:

qbeta(c(0.05, 0.95), a+k, b+N_k)

## [1] 0.7069988 0.8025243

Prediction

We have focused on learning about the population proportion of player scoring with a penalty, \(p\).

We are also interested in predicting the number of player that will score, \(\tilde{y}\), in a future sample of \(m = 20\).

We want to predict the proportion of \(p\) using a Binomial distribution and a Beta prior.

Let \(R-Binomial(m,\theta)\), where \(m\) denotes the number of trials. Suppose that the prior of \(\theta\) is \(Beta(a,b)\). Then, the posterior predictive distribution of \(\tilde{y}\) successes out of \(m\) trials (given the data) is:

\[ f(\tilde{y}=\int f_B(\tilde{y}|m,p)g(p)dp) \] \[ =\binom{m}{\tilde{y}}\frac{B(\alpha+\tilde{y},\beta+m-\tilde{y})}{B(\alpha, \beta)} \] To illustrate, we use the beta parameters, \(\alpha = 83.46\) and \(\beta = 28.05\)

m <- 20
ys <- 0:20

log_val <- lchoose(m, ys) + lbeta(a+ys, b+m-ys) - lbeta(a, b)
pred <- exp(log_val)
df_pred <- data.frame(cbind(ys, pred))
ggplot(df_pred, aes(ys, pred)) +
  geom_col() + ylab("Predictive probability")+      theme_classic()

We can see that the highest probability is found around 15 to 16.