artiebits.com

Predicting Football Results with the Poisson Regression Model

Have you ever wondered how people make money by betting on sports? I have. Especially after I learned about Tony Bloom and Bill Benter who use math models to do it.

Can I do it too? I decided to find that out by building my own model for football, a sport that my wife and I like very much.

As it turns out, there is an old statistical method that tells how often rare things expected to happen in a certain time period. Like how many earthquakes we can expectin Japan this year, or how many goals we expect Messi to score in a football match.

It’s based on the idea that any random process that produces discrete events in a fixed time follows a pattern called Poisson distribution. The random process here is a football match, the fixed time is 90 minutes, and the discrete events are goals.

Football is a random process because there are so many factors that can change the course of a match, such as strategies of coaches, weather conditions, referee’s decisions, injuries, fans’ support, etc. Football is also low-scoring sport compared to basketball or tennis. This means a single goal can change the course of a match significantly too.

So, the math formula for the Poisson distribution is simple:

P(k;λ)=λkeλk!,λ>0P(k; \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}, \lambda > 0

Where:

To use this formula, the following assumptions must be met: average of goals per match must be positive, and the number of expected goals must be a whole number. Because you can’t score -1 or 1.5 goals, right?

Here is an example of predicting probability of a team scoring 2 goals in the next match and given that this team on average scores 1.45 per match:

P(2;1.45)=1.452e1.452!0.24659P(2; 1.45) = \frac{1.45^2 e^{-1.45}}{2!} \approx 0.24659

Which gives us 25% chance of scoring 2 goals in the next match.

You can use this formula to calculate the probability of different number of goals (0, 1, 2, 3, …, NN).

Obtaining and Exploring Data

I use historical data of English Premier League seasons 2010/2011 to 2020/2021, which I scrapped from football-data.co.uk. This website offers lots of data for free, but the data of each season is in separate CSV. I wrote a function that downloaded all files and appended them into one big CSV:

import pandas as pd
import requests
from bs4 import BeautifulSoup


def fetch_data(competition: str, page: str) -> pd.DataFrame:
    base_url = "https://www.football-data.co.uk/"
    response = requests.get(f"{base_url}{page}")
    soup = BeautifulSoup(response.content, "lxml")

    # Find the table containing links to the data
    table = soup.find_all(
        "table", attrs={"align": "center", "cellspacing": "0", "width": "800"}
    )[1]
    body = table.find_all("td", attrs={"valign": "top"})[1]

    # Extract links and link texts from the table
    links = [link.get("href") for link in body.find_all("a")]
    link_texts = [link.text for link in body.find_all("a")]

    # Filter the links for the given competition name and exclude unwanted data
    data_urls = [
        f"{base_url}{links[i]}"
        for i, text in enumerate(link_texts)
        if text == competition
    ][:-12]

    # Fetch data from the urls and concatenate them into a single DataFrame
    data_frames = list(map(fetch_season_data, data_urls))
    merged_df = (
        pd.concat(data_frames, ignore_index=True)
        .dropna(axis=1)
        .dropna()
        .sort_values(by="Date")
    )

    return merged_df


def fetch_season_data(url: str) -> pd.DataFrame:
    season = url.split("/")[4]
    print(f"Getting data for {season}")
    temp_df = pd.read_csv(url)
    temp_df["Season"] = season
    temp_df = (
        temp_df.dropna(axis="columns", thresh=temp_df.shape[0] - 30)
        .drop("Div", axis=1)
        .assign(Date=lambda df: pd.to_datetime(df.Date, dayfirst=True))
        .dropna()
    )
    return temp_df


fetch_data("Premier League", "englandm.php").to_csv("data/epl.csv")

Let’s explore home team win rates to see if there is a consistent trend. To do this, we will create three new columns home_win, draw, and away_win. We will assign 1 or 0 to these columns based on each game’s outcome, then group the result by season and calculate the mean:

df = pd.read_csv("data/epl.csv", dtype={"Season": str})

win_rates = (
    df.assign(
        home_win=lambda df: df.apply(
            lambda row: 1 if row.FTHG > row.FTAG else 0, axis="columns"
        ),
        draw=lambda df: df.apply(
            lambda row: 1 if row.FTHG == row.FTAG else 0, axis="columns"
        ),
        away_win=lambda df: df.apply(
            lambda row: 1 if row.FTHG < row.FTAG else 0, axis="columns"
        ),
    )
    .groupby("Season")
    .mean(numeric_only=True)
    .loc[:, ["home_win", "draw", "away_win"]]
)

Let’s plot the result:

import matplotlib.pyplot as plt

# plot the data
plt.plot(win_rates.index, win_rates["home_win"], label="Home Win")
plt.plot(win_rates.index, win_rates["draw"], label="Draw")
plt.plot(win_rates.index, win_rates["away_win"], label="Away Win")

# format the plot
plt.xticks(rotation=60)
plt.xlabel("Season")
plt.ylabel("Win Rate")
plt.legend()
plt.show()

Home team win rates

As we can see, the win rates are relatively stable across seasons (except for the current one): home team wins about 45% of the time, draws occur around 25% of the time, and the away team wins about 30% of the time. This is known as home-field advantage, a factor we want to include in our formula, along with the team’s attack strength and the opponent’s defence strength.

How do we do that?

Remember the lambda in the Poisson distribution formula? Lambda can vary depending on different factors. For instance, it may be higher for Manchester United than for Fulham because Manchester is stronger. Lambda may also be higher when a team plays at home on a familiar field with supportive fans.

To calculate lambda we need four columns in our data: team name, opponent name, number of goals scored and playing-at-home indicator. We also need two rows per match (one for each team), because we calculate lambda for each team in every match.

For instance, the data set shown below:

HomeAwayHomeGoalsAwayGoals
Man UnitedFulham20

Must be transformed into this:

teamopponentgoalshome
Man UnitedFulham21
FulhamMan United00

With Python it is easier done than said.

def create_model_data(
    home_team,
    away_team,
    home_goals,
    away_goals,
) -> pd.DataFrame:
    home_df = pd.DataFrame(
        data={
            "team": home_team,
            "opponent": away_team,
            "goals": home_goals,
            "home": 1,
        }
    )
    away_df = pd.DataFrame(
        data={
            "team": away_team,
            "opponent": home_team,
            "goals": away_goals,
            "home": 0,
        }
    )
    return pd.concat([home_df, away_df])

model_df= create_model_data(df.HomeTeam, df.AwayTeam, df.FTHG, df.FTAG)

When the data is in a suitable format, we use this formula:

log(λi)=αteami+βopponenti+γhomeilog(\lambda_i) = \alpha_{team_i} + \beta_{opponent_i} + \gamma * home_i

Where:

Fitting the model

Now that we have our data ready, we can fit the model.

import statsmodels.api as sm
import statsmodels.formula.api as smf

def fit_model(
    model_data: pd.DataFrame,
) -> sm.regression.linear_model.RegressionResultsWrapper:
    return smf.glm(
        formula="goals ~ home + team + opponent",
        data=model_data,
        family=sm.families.Poisson(),
    ).fit()

model = fit_model(model_df)
print(model.summary())

model.summary() shows the results of fitting the model.

                 Generalized Linear Model Regression Results
================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                        0.2370      0.051      4.649      0.000       0.137       0.337
team[T.Chelsea]                  0.0017      0.042      0.041      0.967      -0.081       0.085
team[T.Leicester]               -0.1688      0.058     -2.890      0.004      -0.283      -0.054
opponent[T.Chelsea]             -0.1481      0.058     -2.557      0.011      -0.262      -0.035
opponent[T.Leicester]            0.1960      0.066      2.948      0.003       0.066       0.326
home                             0.2754      0.016     17.403      0.000       0.244       0.306
================================================================================================

You will see a much bigger summary in your Python console, with all clubs who have played in the Premier League since season 2010/2011, but I have removed a portion of it for convenience. The most interesting column here is coef, which shows us the values of our coefficients.

Predicting Outcomes

If we want to predict how many goals Chelsea will score against Leicester while playing at home, then we will plug in the numbers into our formula like this:

log(λChelsea)=0.2370+0.1960+0.27541log(\lambda_{Chelsea}) = 0.2370 + 0.1960 + 0.2754 * 1

This means The Blues are playing at home (homeChelsea=1home_{Chelsea} = 1), they have home-field advantage of (γ=0.2754\gamma = 0.2754), they have an intercept of 0.2370, and they are playing against Leicester who has a coefficient of 0.1960.

This simplifies to λi=e0.7084=2.03\lambda_i = e^{0.7084} = 2.03 which means we expect Chelsea to score about 2 goals while playing home against Leicester.

Chelsea and Leicester actually played in the last week of season 20/21, so we can compare our prediction with the actual result. And I, of course, excluded this match from the dataset used to train the model.

To make our prediction, we will use the model.predict() method.

home_team = "Chelsea"
away_team = "Leicester"

home_goals = model.predict(
    pd.DataFrame(
        data={"team": home_team, "opponent": away_team, "home": 1},
        index=[1],
    )
).values[0]

away_goals = model.predict(
    pd.DataFrame(
        data={"team": away_team, "opponent": home_team, "home": 0},
        index=[1],
    )
).values[0]

print(home_goals)  # 2.03
print(away_goals)  # 0.92

The model.predict() returns a numpy array with predicted values. In this case, our model predicts that Chelsea will score 2.03 goals, while Leicester will score 0.92. The actual result of that match? It was 2-1. Not bad for 50 lines of code, right?

But what if we want to know the probability of other outcomes too, such as a win, a draw, or a loss? We can do it too. We take the Poisson distribution and the outer product, and smash them together.

The Poisson distribution tells how likely it is for each team to score xx goals based on their average. The outer product takes probabilities of both team and multiplies them together to make a matrix.

import numpy as np
from scipy.stats import poisson

home_goals = 2.03
away_goals = 0.92

max_goals = 6

proba = [
    [poisson.pmf(i, team_avg) for i in range(0, max_goals)]
    for team_avg in [home_goals, away_goals]
]

matrix = np.outer(np.array(proba[0]), np.array(proba[1]))

print(matrix)

The result is something like this:

[5.23397059e-02, 4.81525295e-02, 2.21501636e-02, 6.79271682e-03, 1.56232487e-03, 2.87467776e-04],
[1.06249603e-01, 9.77496348e-02, 4.49648320e-02, 1.37892152e-02, 3.17151949e-03, 5.83559585e-04],
[1.07843347e-01, 9.92158794e-02, 4.56393045e-02, 1.39960534e-02, 3.21909228e-03, 5.92312979e-04],
[7.29739982e-02, 6.71360784e-02, 3.08825960e-02, 9.47066279e-03, 2.17825244e-03, 4.00798449e-04],
[3.70343041e-02, 3.40715598e-02, 1.56729175e-02, 4.80636136e-03, 1.10546311e-03, 2.03405213e-04],
[1.50359275e-02, 1.38330533e-02, 6.36320450e-03, 1.95138271e-03, 4.48818024e-04, 8.25825165e-05]

Each cell in this matrix represents probability of Chelsea scoring x goals and Leicester scoring y goals. You can look at this matrix like this:

["0-0", "0-1", "0-2", "0-3", "0-4", "0-5"],
["1-0", "1-1", "1-2", "1-3", "1-4", "1-5"],
["2-0", "2-1", "2-2", "2-3", "2-4", "2-5"],
["3-0", "3-1", "3-2", "3-3", "3-4", "3-5"],
["4-0", "4-1", "4-2", "4-3", "4-4", "4-5"],
["5-0", "5-1", "5-2", "5-3", "5-4", "5-5"]

This matrix shows all possible ways a football match can end.

When we look at the diagonal line that goes from top left to bottom right - that’s where all the draws are. Why? Because those are the cases where both teams score the same number of goals. To get the probability of a draw, just add up all numbers on that line.

When we look at everything below that line - that’s where all the home team wins are. Because those are the cases where the home team scores more than the away team. To get the probability of a home team win, add up all numbers below the line.

And finally, look at everything above that line. That’s where all the away team wins are. To find the probability of an away team win, add up all numbers above that area.

home_team_win = np.sum(np.tril(matrix, -1))
draw = np.sum(np.diag(matrix))
away_team_win = np.sum(np.triu(matrix, 1))

print(home_team_win)  # 0.6135
print(draw)  # 0.2063
print(away_team_win)  # 0.1620

But can we trust this model? How accurate are its predictions?

Let’s put it to the test.

We will predict matches that already happened in season 2020/2021 and compare them with actual results.

To make the model-creation related logic more flexible, I refactored it into a function that accepts a list of home and away teams. This allows us to simulate matches for multiple team combinations.

def simulate_match(
    home_teams_df,
    away_teams_df,
    model,
    max_goals = 6,
):
    home_teams = home_teams_df.tolist()
    away_teams = away_teams_df.tolist()

    df = pd.DataFrame()

    for i in range(0, len(home_teams)):
        expg1 = model.predict(
            pd.DataFrame(
                data={
                    "team": home_teams[i],
                    "opponent": away_teams[i],
                    "home": 1,
                },
                index=[1],
            )
        ).values[0]

        expg2 = model.predict(
            pd.DataFrame(
                data={
                    "team": away_teams[i],
                    "opponent": home_teams[i],
                    "home": 0,
                },
                index=[1],
            )
        ).values[0]

        team_pred = [
            [poisson.pmf(i, team_avg) for i in range(0, max_goals + 1)]
            for team_avg in [expg1, expg2]
        ]

        matrix = np.outer(np.array(team_pred[0]), np.array(team_pred[1]))

        home_team_win = np.sum(np.tril(matrix, -1))
        draw = np.sum(np.diag(matrix))
        away_team_win = np.sum(np.triu(matrix, 1))

        temp_df = pd.DataFrame(
            data={
                "home_team": home_teams[i],
                "away_team": away_teams[i],
                "home_team_win": home_team_win,
                "draw": draw,
                "away_team_win": away_team_win,
            },
            index=[1],
        )

        df = pd.concat([df, temp_df], ignore_index=True)

    return df

Now we will split our data into training and validation datasets. The training set is used to build the model, and the validation dataset is used to test its accuracy. To do this, we will use a handy function from scikit-learn called train_test_split:

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, random_state=0)

model_df = create_model_data(
    train_data.HomeTeam, train_data.AwayTeam, train_data.FTHG, train_data.FTAG
)

model = fit_model(model_df)

# Make predictions using the test set
predictions = simulate_match(test_data.HomeTeam, test_data.AwayTeam, model)

actual_outcomes = test_data.FTR

predicted_outcomes = np.where(
    predictions.home_team_win > predictions.away_team_win,
    "H",
    np.where(
        predictions.away_team_win > predictions.home_team_win,
        "A",
        "D",
    ),
)

# Calculate accuracy score of the model
accuracy = accuracy_score(actual_outcomes, predicted_outcomes)
print("Accuracy:", accuracy)  # 0.55

Our model has a 55% accuracy rate, which is 5% better than random guessing. Not bad, huh? Sometimes, that’s enough to make money. Casinos know that well.

But we’re up against bookmakers here. They have more data, they employ armies of nerds and quants. And use their predictions to set the odds. And they make sure the odds are always in their favor.

The Poisson model alone sucks against bookmakers. It ignores everything that matters in a game, such as team lineup, weather conditions, etc. It assumes that goals are independent of one another, when in reality, teams can switch to a defensive mode after scoring a goal, or feel confident enough to score more. It’s too naive to compete against bookmakers.

Still, I am not giving up. The Poisson model is a good starting point from which we can add features that more closely reflect reality. Next time I’ll show you how to incorporate time-weighting to the Poisson model so that recent matches are more important than older ones. Maybe that will boost the accuracy?