Power Query techniques: Merging tables of data part two
Simon Hurst looks at advanced merges in Power Query, including joins that involve more than a single field and approximate lookups, using practical examples, and he considers some of the issues and problems that you might come across when merging tables.
Note: The Power Query tools started off as an optional add-in for Excel version 2010 and 2013 before they became an integral part of Excel 2016 as the Get and Transform group of the Data Ribbon tab. The same tools can be found in the external data group of the Home Ribbon tab of Power BI desktop. Throughout this series we will refer to all of these tools as 'Power Query' and base our instructions on the use of Excel 2016.
Using multiple fields to join tables
In the previous article we looked at merging two tables using a single field to make the link. It is possible to join tables where multiple fields are required to identify the linked rows.
In this example, we have created budgets by country, by salesperson and by product. We want to compare our budgets to our actual performance by summing our actual invoice values in the same way. The merge operation starts in the same way as a simple merge using Home Ribbon tab, Combine, Merge Queries dropdown, Merge Queries as New.
We have already created queries for our Budgets table and our Invoices and we choose Budgets as the first table and Invoices as the second. We can then select each pair of fields, holding down the control key to allow us to select more than a single pair. It is important to select the fields in the correct order; a small number appears in each field header to show which fields are linked to which:
Last time, we expanded the table column created by the merge operation to show additional information in multiple rows where necessary. In this case, we can click the icon to the right of the column heading and use Aggregate to sum our ExtendedPrice values in a single cell to create our budget vs actual comparison:
A straightforward merge in Power Query is based on an exact match, either our values in each table match or they don't.
The Excel lookup and match functions allow for an additional type of match: an approximate match. In an approximate match, the value in the first table matches the largest value in the second table that is less than or equal to the first table value (the MATCH() function also allows for a smallest larger, rather than largest smaller, approximate match).
You will find fairly complicated sections of programming code to create approximate Power Query matches but such matches can also be achieved using more standard Power Query techniques.
It helps to think of an approximate match in a slightly different way. Here is an example based on the one we used in the recent EZ guide to lookups. The task is to find which band of discount rates each of our sales values falls into.
Industry insightsView more
In Excel we could use an approximate VLOOKUP() to retrieve the discount rate values:
=VLOOKUP([@[Total sales value]],DiscountBands,2,TRUE)
In Power Query, rather than thinking of this in terms of a traditional lookup, we could imagine our table of sales values sorted in ascending order with a column added to show the discount rate where the minimum sales value matches the total sales value.
If we were to copy our discount rate down to fill in the intervening, blank, cells in this imaginary column we would have added the correct discount rate to each sales value. However, there is a complication. This would work well if we knew that there was a match for each of our minimum sales values, but we can't be sure of this. So we need to include in our imaginary column any unmatched minimum sales values in their correct position.
To do this, we start off with a normal match from the minimum sales column to the total sales value:
The key is the Join Kind that we have chosen. Rather than the normal Left or Right Outer join that keeps all the rows in one of our tables and only the matching rows in the other, we have used a Full Outer join. This includes all rows from both tables, whether or not they match.
This is the result after expanding both the columns in the DiscountRate table. We can see that, in fact, none of our values match hence the null values in each of our columns:
Next, we need to start creating our imaginary table as an actual set of Power Query columns. First, we use Add Column, Conditional Column to create a single column that replaces the nulls with the values from the other column:
We then sort this column in ascending order:
We can click on our Discount rate column and use Transform Ribbon tab, Any Column group, Fill Down to copy the existing values down, replacing the null values:
We can then filter out the rows with null in the Total Sales value column to remove the rows that only exist in our DiscountBands table, and then remove the 'DiscountBand.Minimum sales' and 'All Values' columns to leave us with our original sales values with the matching discount rate:
Like almost everything else in Power Query, matches are case sensitive. In the example below, two of our products have not been matched because the letters have been entered in a different case in the two merged tables:
This can be avoided by using Transform Ribbon tab, Text Column group, Format dropdown, UPPERCASE to convert each code column to uppercase.
For matches to work successfully, the data type of each column must be the same. Here we have corrected the capitalisation issue in our original data tables but have left the Code column in Products set to 'Any' and changed the Code column in ProductCategories to 'Text'.
Although both the 'Any' and 'Text' data type allow text, number and mixed entries, when merged, the codes containing only numbers are not matched.
Changing both columns to 'Text' leads to all columns being matched correctly.
Simon Hurst is the founder of technology training consultancy The Knowledge Base and is a past chairman of the ICAEW's IT Faculty.