Our goal in the next two parts of this series will be to train a machine learning algorithm to predict the UK unleaded petrol price from the price of brent crude oil.
This is what we call a “supervised” machine learning problem as we are given a training set containing both features (crude oil prices) along with their labels (unleaded petrol prices).
The other category viz. an “unsupervised” machine learning problem, looks at data sets which don’t contain labels, seeks to find hidden patterns in the data. A famous example of this was when a farm of Google computers taught itself to recognise cats from 10 million YouTube stills.
Introducing the problem
Our dataset will use the weekly (ultra low sulphur) unleaded petrol prices (£ pence per litre) published by the UK Department for Business, Energy & Industrial Strategy between June 2003 and February 2018. We’ve linked this with brent crude oil prices ($ per barrel) sourced from the U.S. Energy Information Administration.
It’s worth noting that we’ve simplified the problem somewhat, by not including a number of other features like the UK petrol duty rate, UK VAT rate (which ranged between 15-20% during this period) and the GBPUSD rate for example. The dataset can be downloaded here.
Once you’ve downloaded this data, go ahead and upload it to your Azure Notebooks library just like before. From here we can open a notebook and load the data as follows:
Visualising the data
To begin to understand the relationship between our crude_oil and petrol variables, it would be helpful to visualise the data we have here (768 weeks in total) on a scatter chart.
Content seriesView full content series
This is something we can already do in Excel, so let’s go a little further. For each dot on the scatter chart (x = crude_oil, y = petrol) we’ll also apply a colour scale depending on the date. By doing this it will be much easier to see trends over time (if there are any).
The code for this is as follows:
The code for this is a bit more involved when compared to previous plots, as it contains a “marker” attribute which we’ve used to define the colour scale we want.
Adding this colour scale turned out to be worth the effort as we can see an interesting clustering effect occurring between crude_oil and petrol depending on the date range.
Next week we’ll dig into this data further and apply a fairly common machine learning technique called “linear regression” to describe the relationship between the crude oil and unleaded petrol prices. An important aspect of this will be that we also get a level of confidence in the strength of the relationship between these two variables.
Click here to see the code we covered today.
About Jahan Zahid
Jahan is the founder and CEO of Indigo Cashflow, an app created to take the pain out of cash flow forecasting. Prior to this he ran a successful software consultancy, developed FX trading models at Bank of America Merrill Lynch using machine learning techniques, and worked as a Mathematics Lecturer at Bristol University. He holds a PhD in Mathematics from Oxford University.