artiebits.com

Predicting Football Results with Poisson Regression

Have you ever wondered how people make money by betting on sports? I have, after learning about Tony Bloom and Bill Benter, who made fortunes using mathematical models.

This sparked my curiosity to build my own model for football, a sport my wife and I enjoy.

As it turned out, there is a statistical method for predicting how often rare events are expected to happen in a fixed time period. Like how many earthquakes might hit Japan this year, or how many goals Messi might score in a match.

It is called Poisson distribution, and it tells the probability of discrete events happening in a fixed interval when the average rate of occurrence is known and constant.

Football matches fit this description well, because there is a lot of randomness that can change the course of a match (coach strategy, weather, referee decisions, injuries) where discrete events (goals) occur within a fixed time (90 minutes). It’s also a low-scoring sport compared to basketball or tennis. This means a single goal can change the course of a game significantly.

The formula for the Poisson distribution is:

P(k;λ)=λkeλk!,λ>0P(k; \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}, \lambda > 0

Where:

To use this formula, the average number of goals (λ\lambda) must be positive number, and the events (goals) must be discrete (you can’t score half a goal).

Here’s an example: predicting the probability of a team scoring 2 goals in their next match, given their average is 1.45 goals per match:

P(2;1.45)=1.452e1.452!0.24659P(2; 1.45) = \frac{1.45^2 e^{-1.45}}{2!} \approx 0.24659

This gives a 25% chance of scoring exactly 2 goals in the next match. You can use this formula to calculate the probability for any number of goals (0, 1, 2, 3, …).

Obtaining and Exploring Data

I used historical data from English Premier League seasons 2010/2011 to 2020/2021, which I scraped from football-data.co.uk. This site offers a lot of data for free (many thanks!), though each season is in a separate CSV file, so I wrote a Python function to download these files and combine them into a single CSV:

import pandas as pd
import requests
from bs4 import BeautifulSoup


def fetch_data(competition: str, page: str) -> pd.DataFrame:
    base_url = "https://www.football-data.co.uk/"
    response = requests.get(f"{base_url}{page}")
    soup = BeautifulSoup(response.content, "lxml")

    # Find the table containing links to the data
    table = soup.find_all(
        "table", attrs={"align": "center", "cellspacing": "0", "width": "800"}
    )[1]
    body = table.find_all("td", attrs={"valign": "top"})[1]

    # Extract links and link texts from the table
    links = [link.get("href") for link in body.find_all("a")]
    link_texts = [link.text for link in body.find_all("a")]

    # Filter the links for the given competition name and exclude unwanted data
    data_urls = [
        f"{base_url}{links[i]}"
        for i, text in enumerate(link_texts)
        if text == competition
    ][:-12]

    # Fetch data from the urls and concatenate them into a single DataFrame
    data_frames = list(map(fetch_season_data, data_urls))
    merged_df = (
        pd.concat(data_frames, ignore_index=True)
        .dropna(axis=1)
        .dropna()
        .sort_values(by="Date")
    )

    return merged_df


def fetch_season_data(url: str) -> pd.DataFrame:
    season = url.split("/")[4]
    print(f"Getting data for {season}")
    temp_df = pd.read_csv(url)
    temp_df["Season"] = season
    temp_df = (
        temp_df.dropna(axis="columns", thresh=temp_df.shape[0] - 30)
        .drop("Div", axis=1)
        .assign(Date=lambda df: pd.to_datetime(df.Date, dayfirst=True))
        .dropna()
    )
    return temp_df


fetch_data("Premier League", "englandm.php").to_csv("data/epl.csv")

Let’s explore home team win rates to see if there’s a consistent trend. We’ll create columns for home win, draw, and away win (1 or 0), then group by season and calculate the mean:

df = pd.read_csv("data/epl.csv", dtype={"Season": str})

win_rates = (
    df.assign(
        home_win=lambda df: df.apply(
            lambda row: 1 if row.FTHG > row.FTAG else 0, axis="columns"
        ),
        draw=lambda df: df.apply(
            lambda row: 1 if row.FTHG == row.FTAG else 0, axis="columns"
        ),
        away_win=lambda df: df.apply(
            lambda row: 1 if row.FTHG < row.FTAG else 0, axis="columns"
        ),
    )
    .groupby("Season")
    .mean(numeric_only=True)
    .loc[:, ["home_win", "draw", "away_win"]]
)

Now, let’s plot the results:

import matplotlib.pyplot as plt

# plot the data
plt.plot(win_rates.index, win_rates["home_win"], label="Home Win")
plt.plot(win_rates.index, win_rates["draw"], label="Draw")
plt.plot(win_rates.index, win_rates["away_win"], label="Away Win")

# format the plot
plt.xticks(rotation=60)
plt.xlabel("Season")
plt.ylabel("Win Rate")
plt.legend()
plt.show()

Home team win rates

As we can see in the plot, win rates are quite stable across seasons (except possibly the most recent incomplete one): home team wins around 45% of the time, draws occur about 25%, and away team wins roughly 30%. This home-field advantage is a factor we want to incorporate into our model, along with each team’s attacking and defensive strengths.

How can we do this?

Remember the lambda (the average number of goals) in the Poisson distribution formula?

Lambda is not a fixed value but varies depending on factors like the specific teams playing. It might be higher for Manchester United than for Fulham because Manchester United is generally stronger. It might also be higher when a team plays at home due to the familiar environment and crowd support.

To calculate lambda for each team in every match, our data needs four columns: team name, opponent name, goals scored, and a home indicator. We willl structure data with two rows per match – one for home team and one for away team.

For instance, a match like this:

HomeAwayHomeGoalsAwayGoals
Man UnitedFulham20

Should be transformed into this format for the model:

teamopponentgoalshome
Man UnitedFulham21
FulhamMan United00

This transformation is straightforward in Python:

def create_model_data(
    home_team,
    away_team,
    home_goals,
    away_goals,
) -> pd.DataFrame:
    home_df = pd.DataFrame(
        data={
            "team": home_team,
            "opponent": away_team,
            "goals": home_goals,
            "home": 1,
        }
    )
    away_df = pd.DataFrame(
        data={
            "team": away_team,
            "opponent": home_team,
            "goals": away_goals,
            "home": 0,
        }
    )
    return pd.concat([home_df, away_df])

model_df= create_model_data(df.HomeTeam, df.AwayTeam, df.FTHG, df.FTAG)

With the data in this format, we can model the logarithm of the expected number of goals (λi\lambda_i) for team ii using a linear combination of factors:

log(λi)=α0+αteami+βopponenti+γhomeilog(\lambda_i) = \alpha_0 + \alpha_{team_i} + \beta_{opponent_i} + \gamma * home_i

Where:

Fitting the Model

Now that the data is ready, we fit the model using Generalized Linear Models (GLM) with a Poisson distribution family from the statsmodels library.

import statsmodels.api as sm
import statsmodels.formula.api as smf

def fit_model(
    model_data: pd.DataFrame,
) -> sm.regression.linear_model.RegressionResultsWrapper:
    return smf.glm(
        formula="goals ~ home + team + opponent",
        data=model_data,
        family=sm.families.Poisson(),
    ).fit()

model = fit_model(model_df)
print(model.summary())

model.summary() displays the fitting results:

                 Generalized Linear Model Regression Results
================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                        0.2370      0.051      4.649      0.000       0.137       0.337
team[T.Chelsea]                  0.0017      0.042      0.041      0.967      -0.081       0.085
team[T.Leicester]               -0.1688      0.058     -2.890      0.004      -0.283      -0.054
opponent[T.Chelsea]             -0.1481      0.058     -2.557      0.011      -0.262      -0.035
opponent[T.Leicester]            0.1960      0.066      2.948      0.003       0.066       0.326
home                             0.2754      0.016     17.403      0.000       0.244       0.306
================================================================================================

The full summary in your console will be much larger, showing coefficients for all teams, but I removed a big portion of it for convenience. The most important column here is coef, which gives us the values for our intercept, home advantage, and each team’s attack/defense strength relative to a baseline.

Predicting Outcomes

To predict how many goals Chelsea might score against Leicester at home, we plug the relevant coefficients from the summary into our formula:

log(λChelsea)=0.237(intercept)+0.0017(Chelseaattack)+0.196(Leicesterdefense)+0.2754(homeadvantage)1log(\lambda_{Chelsea}) = 0.237 (intercept) + 0.0017 (Chelsea attack) + 0.196 (Leicester defense) + 0.2754 (home advantage) * 1

This gives us λi=e0.71=2.03\lambda_i = e^{0.71} = 2.03 which means we expect Chelsea to score about 2.03 goals playing at home against Leicester.

Chelsea and Leicester actually played in the last week of season 20/21, so we can compare our prediction with the actual result. And I, of course, excluded this match from the dataset used to train the model.

We can use the model.predict() method to get this directly:

home_team = "Chelsea"
away_team = "Leicester"

home_goals = model.predict(
    pd.DataFrame(
        data={"team": home_team, "opponent": away_team, "home": 1},
        index=[1],
    )
).values[0]

away_goals = model.predict(
    pd.DataFrame(
        data={"team": away_team, "opponent": home_team, "home": 0},
        index=[1],
    )
).values[0]

print(home_goals)  # 2.03
print(away_goals)  # 0.92

The model predicts Chelsea to score 2.03 goals and Leicester to score 0.92. The actual result of that match in the final week of the 20/21 season (which I excluded from the training data) was 2-1. Not bad for 50 lines of code, right?

What if we want to know the probability of specific outcomes like a win, draw, or loss? We combine the Poisson distribution with the concept of an outer product.

The Poisson distribution provides the probability of each team scoring a certain number of goals (0, 1, 2, etc.) based on their predicted average λi\lambda_i. The outer product combines these individual probabilities into a matrix representing the probability of every possible scoreline.

import numpy as np
from scipy.stats import poisson

home_goals = 2.03
away_goals = 0.92

max_goals = 6

proba = [
    [poisson.pmf(i, team_avg) for i in range(0, max_goals)]
    for team_avg in [home_goals, away_goals]
]

matrix = np.outer(np.array(proba[0]), np.array(proba[1]))

print(matrix)

The resulting matrix might look something like this (values will differ slightly due to max_goals range):

[5.23397059e-02, 4.81525295e-02, 2.21501636e-02, 6.79271682e-03, 1.56232487e-03, 2.87467776e-04],
[1.06249603e-01, 9.77496348e-02, 4.49648320e-02, 1.37892152e-02, 3.17151949e-03, 5.83559585e-04],
[1.07843347e-01, 9.92158794e-02, 4.56393045e-02, 1.39960534e-02, 3.21909228e-03, 5.92312979e-04],
[7.29739982e-02, 6.71360784e-02, 3.08825960e-02, 9.47066279e-03, 2.17825244e-03, 4.00798449e-04],
[3.70343041e-02, 3.40715598e-02, 1.56729175e-02, 4.80636136e-03, 1.10546311e-03, 2.03405213e-04],
[1.50359275e-02, 1.38330533e-02, 6.36320450e-03, 1.95138271e-03, 4.48818024e-04, 8.25825165e-05]

Each cell matrix[x, y] contains the probability of the home team scoring x goals and the away team scoring y goals. We can conceptualize this matrix like a score grid:

["0-0", "0-1", "0-2", "0-3", "0-4", "0-5"],
["1-0", "1-1", "1-2", "1-3", "1-4", "1-5"],
["2-0", "2-1", "2-2", "2-3", "2-4", "2-5"],
["3-0", "3-1", "3-2", "3-3", "3-4", "3-5"],
["4-0", "4-1", "4-2", "4-3", "4-4", "4-5"],
["5-0", "5-1", "5-2", "5-3", "5-4", "5-5"]
home_team_win = np.sum(np.tril(matrix, -1))
draw = np.sum(np.diag(matrix))
away_team_win = np.sum(np.triu(matrix, 1))

print(home_team_win)  # 0.6135
print(draw)  # 0.2063
print(away_team_win)  # 0.1620

For the Chelsea vs Leicester example, the probabilities calculated were roughly 61% for a home win, 21% for a draw, and 16% for an away win.

Conclusion

This might seem promising, but we’re competing against professional bookmakers. They use far more data, employ armies of quants and nerds, and set odds in their favor. And the Poisson model alone sucks. It ignores everything that matters in a game, such as injuries, suspensions, lineups, player form, weather, etc. It also assumes goals happen independently, which isn’t always true – a team might become more defensive after scoring or more attacking when losing, which violates key Poisson assumptions that events are independent.

Despite these limitations, the Poisson model is a good starting point. We can enhance it by adding features that better reflect the complexities of football. In a future post, I’ll show how to incorporate time-weighting so that more recent matches have a greater influence on the model’s parameters.

Stay tuned for the next piece – maybe we will get closer to beating the bookies!