Simon Hurst revs up his Excel Zone mini-series by using Power BI to tackle dodgy dates, numbers that don't add up and duplicates.
Excel 2010 introduced the Power Query add-in. In Excel 2016 the Power Query tools became an integral part of Excel as the Get & Transform group of the Data Ribbon tab. In addition, the completely free Microsoft Power BI Desktop application now includes the tools as part of the Get Data process. Although the obvious use of these tools is to connect to external data sources and process data for use in Excel analysis and presentation, their impact may be much more significant when they are used with the contents of a single Excel workbook.
This is the first part of a short series that will examine how these tools can replace a whole set of more traditional spreadsheet techniques. Note that, for convenience, we will refer to Power Query throughout, but Excel 2016 Get & Transform and Power BI Get Data operate in a similar way.
A trawl through past Excel Any Answers questions will reveal several recurring issues. Here, we will look at how Power Query can address issues of misbehaving dates, numbers that aren't numbers, duplicate entries and rounding.
A refreshing change
Before we get started, it's worth pointing out one very important difference between using cell-based Excel formulae and Power Query. As long as calculation is set to automatic in Excel, changes in any Excel cell will cause all the cells that depend on the changed cell to be recalculated immediately and automatically. In contrast, Power Query applies a set of steps on demand. Accordingly, changes are only reflected when the queries involved are refreshed. The refresh can be started manually, or set up to be performed automatically after a certain time interval, but it is not triggered by a change in the underlying cells.
Data in cells
This is the data we will be working with. It is just set up as a contiguous block of cells with column headings. We can see that our dates are American and, as such, are treated as text. One of our dates is also invalid. Our Product column includes a structure product code that we might need to split into its constituent parts, and the numbers in our value column have been entered as text rather than as values. Our last two entries are the same:
We could sort all of these issues out using standard Excel techniques and functions. Assuming that we need to apply the same processes repeatedly, as the data we are working with changes, we would probably need to create additional columns to take our cell entries and change them into the form that we need to work with. Our alternative approach will be to use a Query to apply a set of steps to our block of cells and create a separate output Table.
Reading our cells into a query
As we have mentioned, Power Query allows you to connect to external data sources via the New Query option. In our case, we want to use our range of cells within our worksheet. As we have avoided any blank rows or columns, we can click on any of our data cells and then use the From Table command. This will first turn our block of cells into an Excel Table and then open up the Query Editor:
Before we start addressing the various issues in our table we'll just set out how the Query Editor works. The end result of what we do in the editor is a program written in the M language which we can see by clicking the Advanced Editor command in the Query group of the Query Editor Home Ribbon tab.
However, we don't need to write any code ourselves, or even look at the code generated. As we perform actions, choose from dropdowns and the right-click menus, and use the extensive range of Ribbon commands, the query editor will construct a series of sequential steps. These steps are shown in the APPLIED STEPS pane shown above. Depending on the nature of a step, it can be deleted or edited and the steps can be re-ordered, although this should be done with care as steps often depend on previous steps, so can generate an error if their position is changed.
A great benefit of the step approach is the ability to examine the actions performed by any step. Clicking on a step will show our data as it is with all the steps applied up to the one we have selected.
The steps are applied in order every time our query is refreshed.
We'll start with our leftmost column. We need to accomplish two different things. First of all, we need to recognise our dates as valid 'English' dates. We will also show how to deal with an invalid date. We right-click in the Date column heading. We want to change our dates to the normal dd/mm/yy format so we choose Change Type, Using Locale and set our Data Type to Date and our Locale to the locale of the original data: English (United States):
We can see that this sorts out our valid dates, but also highlights that one of our dates wasn't valid in the original format:
We can see that the invalid date is on row 5. If we select our previous step: Changed Type, we can see that a value of 30 February has been entered. We can right-click on the date itself and choose the Replace Values option. This will show us the value in the selected cell and we can choose a replacement value, such as 2/29/16:
Note that, because we are inserting a step rather than adding a new step at the end, a warning will be displayed when we select Replace Values.
Our step will replace any values of 2/30/16 with 2/29/16 but will not address any other invalid dates.
Because our step is inserted before the Changed Type with Locale step, when we select that step we can see that it now correctly converts 2/29/16 to 29/02/16, avoiding the previous error for row 5.
Next, we'll demonstrate two different ways of splitting a text entry across multiple columns. First, we will separate out the first two characters of our Product column. Again, we right-click in the column heading and this time choose Split Column…, By Number of Characters… We can set the Number of characters to 2 and choose to perform the split either Once from right or left or Repeatedly.
In our case, we want to choose Once, as far left as possible. This creates two columns, one with the first two characters of the original column and the other with the rest. We want to further split this second column, this time using the dash as a delimiter. We right-click the column heading of our 'remainder' column and choose Split Column…, By Delimiter… We choose a Custom delimiter and enter the dash and choose to make the split: At each occurrence of the delimiter:
We can type new names into each of our three revised column headings. Depending on the contents of our columns, the query editor might take it upon itself to change the data type to Number. If we don't want to do this, we can just select the relevant 'Changed Type' step and delete it.
Numbers as text
When we first pulled our data into the Query Editor, it worked out that our Value column contained numbers rather than text and automatically changed the data type accordingly. We can also do this manually by right-clicking in the column heading and choosing Change Type.
Finally, for our column operations, we will right-click in our Value column heading and choose Transform, Round…, Round and choose the required number of decimal places, 2 in our case.
Our last row is duplicated. We need to be careful when handling duplicates. If we were simply to click in a column heading and choose Remove Duplicates, the Query Editor would remove all items that occurred more than once within just that column. We need to select all our columns before we right-click in a column heading and choose Remove Duplicates. The Editor will check for items where the entries are the same in all of the columns that we have selected:
Once we have set up all our steps and our data is in the format that we can work with, we can click the Close & Load command to load the result into a separate Excel Table. Refreshing this Table will rerun our query steps, using our original Table as the source, and so will bring in any new or revised values: