Cutting Big Data down to size: Client broadband speeds
In the first part of this series we discussed possible definitions of Big Data and considered whether some of the technology and techniques involved could be of use to businesses of all sizes. As a relatively gentle introduction, this time we are going to use Microsoft Power BI Desktop to combine some internal accounting data with over 1.6 million rows of data downloaded from the Ofcom website.
Although by most definitions of Big Data, which can run into billions or even trillions of records and file sizes in terabytes and above, 1.6m rows is tiny, it is more rows than an Excel worksheet can hold.
Power BI Desktop
As we mentioned last time, because the availability of the Excel Power BI tools varies with the different versions and editions of Excel, we will also be using the freely available Power BI Desktop to demonstrate many of the techniques that we will be looking at. This can be downloaded here.
Power BI Desktop combines many of the BI capabilities recently added to Excel. Its ability to acquire and manipulate data is very similar to the tools available in Excel Power Query (or, from Excel 2016, the Excel Get & Transform group of the Data Ribbon tab). Further data processing can be carried out once the data has been made available in Power BI using tools from Power Pivot, such as the data analysis function language DAX (Data Analysis eXpressions), the creation of new Measures and the creation and management of relationships between different tables of data. Power BI reports combine many of the new types of charts included in Excel 2016 with the visualisation capabilities of Power View and many new features to create interactive visualisations of the data.
At this stage, we are just going to look at a few Power BI features rather than attempt to cover its extensive capabilities in detail. A recent series in Excel Zone has covered the Power Query aspects in some detail.
To demonstrate the ability to combine internal and external data, we are going to use some client information in an Excel spreadsheet and link this to publicly-available information on broadband speeds. To give this some kind of practical context, perhaps our business provides some sort of service that is affected by the availability of a fast broadband connection and we want to know which of our clients is likely to be able to take up the service.
For the purposes of this demonstration, our client data is a list of a dozen or so pubs in Sussex, but it should show how easily this could be replaced by real client data:
Our publicly-available data comes from the Ofcom site and is a set of 121 separate CSV files that together make up a 171Mb file with over 1.6m rows. It can be downloaded in zipped form here. The file is: Fixed Postcode 2016 ZIP, 48.4 MB
If you are using this file in a real project it would be advisable to read the accompanying 'About this data' notes to help understand how the data is structured and any limitations. As an example of the raw data, this is one of the files opened as a CSV file in Excel:
After unzipping the individual files to a separate folder (in our case E:\Broadband) we can start using Power Query to create a single table in the format that we need. Having installed the free version of Power BI Desktop we can launch the application and go to the Get Data option, either in the Home Ribbon tab or directly from the Get Started screen:
We use the (More,)File, Folder, Connect option to Browse to our Broadband folder:
We will see a list of some of the files in the folder and we can either Load the list of files to Power BI or use Combine to create a single table containing all of the data in all of the files. This will only work if all the files in the folder are of the same type and structure. We will use the Combine & Edit option to open the Query Editor to apply changes to our data, rather than loading it directly into Power BI.
We can specify which of our files to use as the example for how all the other files are to be imported and also specify how many rows to use to deduce the data types:
For our very simple example we just want the postcode (without spaces) column and the Average Download speed column. We've dragged the Average Download speed column closer to the postcode column so that we can easily select both, right-click on either and then choose to Remove Other Columns:
We can then tidy up our Average download speed column by using Home Ribbon tab, Transform group, Replace Values to replace the N/A values with 0 (this is not necessarily the right thing to do – in the data N/A has a different meaning from 0, but for our purposes we can treat them both as 0). We can then click on the ABC icon in the column header to set the data type to Decimal Number:
We can now use Home Ribbon tab, New Source, Excel to read in our table of client data from our Excel workbook. We need to use Home Ribbon tab, Transform group, Use First Row as Headers to promote the first row of our table to become the column headers, we can then use Replace Values again to remove all the spaces from our postcode column to match the way the postcode column in the broadband data source is structured. Using postcodes without spaces helps prevent issues with differences in the number of spaces between each element.
Now that we have both of our data tables loaded into the Query Editor we can use Home Ribbon tab, Combine group, Merge Queries to specify that we want to use the two postcode columns to link our two tables and that we want to include all the records from our first (clients) table and matching entries from our second (broadband) table. The Query Editor calculates the number of matches found:
We have renamed our merge step to the more meaningful ClientBroadband and clicked the Expand icon at the right side of the column heading to show the Average download speed:
We can now use the Close & Apply option to make our query available in Power BI.
We will look at the Power BI visualisation features in more detail next time, but for now we have just clicked on the Map visualisation and expanded it to fill the available space, then clicked on the expand button for our Client table and dragged Average Download speed to the Size area, Name to Legend and Postcode to Location in order to show our clients' available broadbands speeds by location: