Predicting Football Results with Poisson Regression
Have you ever wondered how people make money by betting on sports? I have, after learning about Tony Bloom and Bill Benter, who made fortunes using mathematical models.
This sparked my curiosity to build my own model for football, a sport my wife and I enjoy.
As it turned out, there is a statistical method for predicting how often rare events are expected to happen in a fixed time period. Like how many earthquakes might hit Japan this year, or how many goals Messi might score in a match.
It is called Poisson distribution, and it tells the probability of discrete events happening in a fixed interval when the average rate of occurrence is known and constant.
Football matches fit this description well, because there is a lot of randomness that can change the course of a match (coach strategy, weather, referee decisions, injuries) where discrete events (goals) occur within a fixed time (90 minutes). It’s also a low-scoring sport compared to basketball or tennis. This means a single goal can change the course of a game significantly.
The formula for the Poisson distribution is:
Where:
- is the probability of scoring goals
- is the average number of goals expected
- is Euler’s number
- is the factorial function
To use this formula, the average number of goals () must be positive number, and the events (goals) must be discrete (you can’t score half a goal).
Here’s an example: predicting the probability of a team scoring 2 goals in their next match, given their average is 1.45 goals per match:
This gives a 25% chance of scoring exactly 2 goals in the next match. You can use this formula to calculate the probability for any number of goals (0, 1, 2, 3, …).
Obtaining and Exploring Data
I used historical data from English Premier League seasons 2010/2011 to 2020/2021, which I scraped from football-data.co.uk. This site offers a lot of data for free (many thanks!), though each season is in a separate CSV file, so I wrote a Python function to download these files and combine them into a single CSV:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def fetch_data(competition: str, page: str) -> pd.DataFrame:
base_url = "https://www.football-data.co.uk/"
response = requests.get(f"{base_url}{page}")
soup = BeautifulSoup(response.content, "lxml")
# Find the table containing links to the data
table = soup.find_all(
"table", attrs={"align": "center", "cellspacing": "0", "width": "800"}
)[1]
body = table.find_all("td", attrs={"valign": "top"})[1]
# Extract links and link texts from the table
links = [link.get("href") for link in body.find_all("a")]
link_texts = [link.text for link in body.find_all("a")]
# Filter the links for the given competition name and exclude unwanted data
data_urls = [
f"{base_url}{links[i]}"
for i, text in enumerate(link_texts)
if text == competition
][:-12]
# Fetch data from the urls and concatenate them into a single DataFrame
data_frames = list(map(fetch_season_data, data_urls))
merged_df = (
pd.concat(data_frames, ignore_index=True)
.dropna(axis=1)
.dropna()
.sort_values(by="Date")
)
return merged_df
def fetch_season_data(url: str) -> pd.DataFrame:
season = url.split("/")[4]
print(f"Getting data for {season}")
temp_df = pd.read_csv(url)
temp_df["Season"] = season
temp_df = (
temp_df.dropna(axis="columns", thresh=temp_df.shape[0] - 30)
.drop("Div", axis=1)
.assign(Date=lambda df: pd.to_datetime(df.Date, dayfirst=True))
.dropna()
)
return temp_df
fetch_data("Premier League", "englandm.php").to_csv("data/epl.csv")
Let’s explore home team win rates to see if there’s a consistent trend. We’ll create columns for home win, draw, and away win (1 or 0), then group by season and calculate the mean:
df = pd.read_csv("data/epl.csv", dtype={"Season": str})
win_rates = (
df.assign(
home_win=lambda df: df.apply(
lambda row: 1 if row.FTHG > row.FTAG else 0, axis="columns"
),
draw=lambda df: df.apply(
lambda row: 1 if row.FTHG == row.FTAG else 0, axis="columns"
),
away_win=lambda df: df.apply(
lambda row: 1 if row.FTHG < row.FTAG else 0, axis="columns"
),
)
.groupby("Season")
.mean(numeric_only=True)
.loc[:, ["home_win", "draw", "away_win"]]
)
Now, let’s plot the results:
import matplotlib.pyplot as plt
# plot the data
plt.plot(win_rates.index, win_rates["home_win"], label="Home Win")
plt.plot(win_rates.index, win_rates["draw"], label="Draw")
plt.plot(win_rates.index, win_rates["away_win"], label="Away Win")
# format the plot
plt.xticks(rotation=60)
plt.xlabel("Season")
plt.ylabel("Win Rate")
plt.legend()
plt.show()
As we can see in the plot, win rates are quite stable across seasons (except possibly the most recent incomplete one): home team wins around 45% of the time, draws occur about 25%, and away team wins roughly 30%. This home-field advantage is a factor we want to incorporate into our model, along with each team’s attacking and defensive strengths.
How can we do this?
Remember the lambda (the average number of goals) in the Poisson distribution formula?
Lambda is not a fixed value but varies depending on factors like the specific teams playing. It might be higher for Manchester United than for Fulham because Manchester United is generally stronger. It might also be higher when a team plays at home due to the familiar environment and crowd support.
To calculate lambda for each team in every match, our data needs four columns: team name, opponent name, goals scored, and a home indicator. We willl structure data with two rows per match – one for home team and one for away team.
For instance, a match like this:
Home | Away | HomeGoals | AwayGoals |
---|---|---|---|
Man United | Fulham | 2 | 0 |
Should be transformed into this format for the model:
team | opponent | goals | home |
---|---|---|---|
Man United | Fulham | 2 | 1 |
Fulham | Man United | 0 | 0 |
This transformation is straightforward in Python:
def create_model_data(
home_team,
away_team,
home_goals,
away_goals,
) -> pd.DataFrame:
home_df = pd.DataFrame(
data={
"team": home_team,
"opponent": away_team,
"goals": home_goals,
"home": 1,
}
)
away_df = pd.DataFrame(
data={
"team": away_team,
"opponent": home_team,
"goals": away_goals,
"home": 0,
}
)
return pd.concat([home_df, away_df])
model_df= create_model_data(df.HomeTeam, df.AwayTeam, df.FTHG, df.FTAG)
With the data in this format, we can model the logarithm of the expected number of goals () for team using a linear combination of factors:
Where:
- is the intercept (baseline goal rate)
- is a coefficient representing the attack strength of team
- is a coefficient representing the defense strength of opponent
- is the coefficient of playing at home
- is a binary variable (1 if team is playing at home, 0 otherwise)
Fitting the Model
Now that the data is ready, we fit the model using Generalized Linear Models (GLM) with a Poisson distribution family from the statsmodels
library.
import statsmodels.api as sm
import statsmodels.formula.api as smf
def fit_model(
model_data: pd.DataFrame,
) -> sm.regression.linear_model.RegressionResultsWrapper:
return smf.glm(
formula="goals ~ home + team + opponent",
data=model_data,
family=sm.families.Poisson(),
).fit()
model = fit_model(model_df)
print(model.summary())
model.summary()
displays the fitting results:
Generalized Linear Model Regression Results
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 0.2370 0.051 4.649 0.000 0.137 0.337
team[T.Chelsea] 0.0017 0.042 0.041 0.967 -0.081 0.085
team[T.Leicester] -0.1688 0.058 -2.890 0.004 -0.283 -0.054
opponent[T.Chelsea] -0.1481 0.058 -2.557 0.011 -0.262 -0.035
opponent[T.Leicester] 0.1960 0.066 2.948 0.003 0.066 0.326
home 0.2754 0.016 17.403 0.000 0.244 0.306
================================================================================================
The full summary in your console will be much larger, showing coefficients for all teams, but I removed a big portion of it for convenience.
The most important column here is coef
, which gives us the values for our intercept, home advantage, and each team’s attack/defense strength relative to a baseline.
Predicting Outcomes
To predict how many goals Chelsea might score against Leicester at home, we plug the relevant coefficients from the summary into our formula:
This gives us which means we expect Chelsea to score about 2.03 goals playing at home against Leicester.
Chelsea and Leicester actually played in the last week of season 20/21, so we can compare our prediction with the actual result. And I, of course, excluded this match from the dataset used to train the model.
We can use the model.predict()
method to get this directly:
home_team = "Chelsea"
away_team = "Leicester"
home_goals = model.predict(
pd.DataFrame(
data={"team": home_team, "opponent": away_team, "home": 1},
index=[1],
)
).values[0]
away_goals = model.predict(
pd.DataFrame(
data={"team": away_team, "opponent": home_team, "home": 0},
index=[1],
)
).values[0]
print(home_goals) # 2.03
print(away_goals) # 0.92
The model predicts Chelsea to score 2.03 goals and Leicester to score 0.92. The actual result of that match in the final week of the 20/21 season (which I excluded from the training data) was 2-1. Not bad for 50 lines of code, right?
What if we want to know the probability of specific outcomes like a win, draw, or loss? We combine the Poisson distribution with the concept of an outer product.
The Poisson distribution provides the probability of each team scoring a certain number of goals (0, 1, 2, etc.) based on their predicted average . The outer product combines these individual probabilities into a matrix representing the probability of every possible scoreline.
import numpy as np
from scipy.stats import poisson
home_goals = 2.03
away_goals = 0.92
max_goals = 6
proba = [
[poisson.pmf(i, team_avg) for i in range(0, max_goals)]
for team_avg in [home_goals, away_goals]
]
matrix = np.outer(np.array(proba[0]), np.array(proba[1]))
print(matrix)
The resulting matrix might look something like this (values will differ slightly due to max_goals
range):
[5.23397059e-02, 4.81525295e-02, 2.21501636e-02, 6.79271682e-03, 1.56232487e-03, 2.87467776e-04],
[1.06249603e-01, 9.77496348e-02, 4.49648320e-02, 1.37892152e-02, 3.17151949e-03, 5.83559585e-04],
[1.07843347e-01, 9.92158794e-02, 4.56393045e-02, 1.39960534e-02, 3.21909228e-03, 5.92312979e-04],
[7.29739982e-02, 6.71360784e-02, 3.08825960e-02, 9.47066279e-03, 2.17825244e-03, 4.00798449e-04],
[3.70343041e-02, 3.40715598e-02, 1.56729175e-02, 4.80636136e-03, 1.10546311e-03, 2.03405213e-04],
[1.50359275e-02, 1.38330533e-02, 6.36320450e-03, 1.95138271e-03, 4.48818024e-04, 8.25825165e-05]
Each cell matrix[x, y]
contains the probability of the home team scoring x
goals and the away team scoring y
goals. We can conceptualize this matrix like a score grid:
["0-0", "0-1", "0-2", "0-3", "0-4", "0-5"],
["1-0", "1-1", "1-2", "1-3", "1-4", "1-5"],
["2-0", "2-1", "2-2", "2-3", "2-4", "2-5"],
["3-0", "3-1", "3-2", "3-3", "3-4", "3-5"],
["4-0", "4-1", "4-2", "4-3", "4-4", "4-5"],
["5-0", "5-1", "5-2", "5-3", "5-4", "5-5"]
- The diagonal cells (where home goals = away goals) represent draws. Summing these probabilities gives the overall probability of a draw.
- The cells below the diagonal (where home goals > away goals) represent home wins. Summing these probabilities gives the overall probability of the home team winning.
- The cells above the diagonal (where home goals < away goals) represent away wins. Summing these probabilities gives the overall probability of the away team winning.
home_team_win = np.sum(np.tril(matrix, -1))
draw = np.sum(np.diag(matrix))
away_team_win = np.sum(np.triu(matrix, 1))
print(home_team_win) # 0.6135
print(draw) # 0.2063
print(away_team_win) # 0.1620
For the Chelsea vs Leicester example, the probabilities calculated were roughly 61% for a home win, 21% for a draw, and 16% for an away win.
Conclusion
This might seem promising, but we’re competing against professional bookmakers. They use far more data, employ armies of quants and nerds, and set odds in their favor. And the Poisson model alone sucks. It ignores everything that matters in a game, such as injuries, suspensions, lineups, player form, weather, etc. It also assumes goals happen independently, which isn’t always true – a team might become more defensive after scoring or more attacking when losing, which violates key Poisson assumptions that events are independent.
Despite these limitations, the Poisson model is a good starting point. We can enhance it by adding features that better reflect the complexities of football. In a future post, I’ll show how to incorporate time-weighting so that more recent matches have a greater influence on the model’s parameters.
Stay tuned for the next piece – maybe we will get closer to beating the bookies!