In this final part, Jahan Zahid covers how to train our machine learning algorithm to predict the UK unleaded price from Brent crude oil.
We’ll be using a popular open source Python library called SciPy, which has a range of simple machine learning algorithms built in.
For more advanced machine learning problems e.g. image classification, I would recommend trying scikit-learn which is itself built from SciPy.
As mentioned in the first part of this machine learning tutorial, we’ll be using a technique called linear regression.
The basic idea here is that we aim to find the line which “best fits” the points on our scatter chart:
Fitting our model
Before writing any code, it’s worth explaining what we mean by finding the line which “best fits” our points. You might recall from O Levels / GCSE / high school maths that a straight line can be defined by the following equation:
Content seriesView full content series
Y = mX + c
Where “m” represents the slope of the line, and “c” is the point at which the line intercepts the y-axis.
What linear regression does is vary the slope and intercept (m, c) through all possible values in such a way that minimises the “least squared distance” between the line and points. For the purpose of this series we won’t go too deep into what that means.
After loading your Azure Notebook from last week, we can train our linear regression model with a few lines of code as follows:
This gives us back a model which has the following values (to three significant figures):
(Note: 8.80e-163 here means 0.000...00088, where there are 162 zeros between the decimal point and 88, i.e. a very small number.)
We’ve covered what the slope and intercept represent. The other values give us some indication of how “reliable” the model is.
The r-value essentially tells us how much variance there is between what the model predicts and what the actual value is. For example, an r-value of 100% tells us that the model fits the data perfectly. We have an r-value 78.7% which is not too bad.
The p-value represents the probability that the null hypothesis (ie there is no relationship between petrol and crude oil prices) is true. A low p-value (usually less than 0.05) tells us that we have a good model.
In our case we have a very low p-value, so we can confidently say there is a significant relationship between the petrol and crude oil price (as expected).
The standard error is quite similar to the r-value in that it tells us how close we expect the points are to the line. The smaller the standard error the better the model is at predicting.
Ultimately, we are free to use whatever machine learning model we create, but it’s important to understand what the reliability of it is using metrics like the above. When getting into more advanced machine learning problems, one common way to test a models reliability is by calculating the area under a “receiver operating characteristic curve” (also commonly known as a ROC curve).
Visualising the result
From the above model we’re able to immediately get following result:
(UK Petrol Price in GBP) = 0.529 * (Brent Crude Oil Price in USD) + 70.5
We can plot this as follows:
As we can see here, our model fits the data fairly well. Note we’ll need to load the variable “data” from last week in order to plot this alongside the line.
Congratulations if you have made it this far!
To take this further it would be a good exercise to investigate ways in which we can improve this model. For example, restrict the range of dates of petrol and crude oil prices. Are there other ways you can think of for improving this model?
Click here to see the code we covered today.
About Jahan Zahid
Jahan is the founder and CEO of Indigo Cashflow, an app created to take the pain out of cash flow forecasting. Prior to this he ran a successful software consultancy, developed FX trading models at Bank of America Merrill Lynch using machine learning techniques, and worked as a Mathematics Lecturer at Bristol University. He holds a PhD in Mathematics from Oxford University.