artiebits.com

The Surprising Effect of Dixon-Coles Weighting Function on Football Predictions

In my last post, about predicting football match outcomes using the Poisson model, I mentioned it is too naive and ignores many factors that matter in the game. For instance, it doesn’t account for changes that happen in teams over time, such as new players and coaches.

In 1997, Dixon and Coles proposed a solution to address this issue. They concluded that recent games have more relevance than older ones and developed a formula that weights recent games more.

The weighting function they used is:

ϕ(t)=exp(ξt)\phi(t) = exp(-\xi t)

Where tt is the number of days since the game was played, and ξ\xi is a parameter that controls how much we care about newer games. A higher ξ\xi means newer games are more important.

Dixon and Coles used half weeks as their time unit, assuming that most games are played on weekends, and found that 0.0065 was the best value for ξ\xi. However, nowadays games are played on any day of the week, so I will use days instead. Meaning, I will divide 0.0065 by 3.5, because a half week is 3.5 days. That gives us 0.00186.

Here is how I implemented the Dixon-Coles weighting function in Python:

import numpy as np

def calculates_weights(dates, xi=0.00186):
    latest_date = dates.max()  # Use the most recent match date as reference
    diff = (latest_date - dates).dt.days  # Get the time difference in days
    return np.exp(-1 * xi * diff)  # Apply the exponential function

This function takes a list of match dates and calculates the weights based on today’s date and ξ\xi.

To use these weights in the glm function, I had to duplicate each match row as I did in my previous post. That’s because each match has two independent observations: one for the home team and one for the away team.

Now let’s validate our weighting function to see how it affects the accuracy of the Poisson model:

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Check previous post to see how to implement these functions
from prediction_functions import create_model_data, fit_model, calculates_weights, predict


def calculates_weights(dates, xi=0.00186):
    latest_date = dates.max()
    diff = (latest_date - dates).dt.days
    return np.exp(-1 * xi * diff)


df = pd.read_csv("data/Premier-League.csv").assign(
    Date=lambda df: pd.to_datetime(df.Date)
)

train_data, test_data = train_test_split(df, random_state=0)

model_df = create_model_data(
    train_data.HomeTeam, train_data.AwayTeam, train_data.FTHG, train_data.FTAG
)

weights = calculates_weights(df.Date)
weights = pd.concat([weights, weights])

model = fit_model(model_df, weights)

# Make predictions using the test set
predictions = predict(test_data.HomeTeam, test_data.AwayTeam, model)

I tried different values for ξ\xi, but to my surprise, the time-weighting only made it worse — the accuracy of predictions dropped from 0.55 to 0.52.

Well, experiments are not always successful. Sometimes they fail, and that’s OK. Finding the right answer requires exploring and testing hypotheses, and failing is part of that process.

So, I am not giving up on predicting football match outcomes, and I have some ideas for future experiments. One idea is to experiment with machine learning models, such as logistic regression or random forests, that can learn from complex patterns in the data. I will keep you updated.