This is the fourth part of a series of articles that looks at how Excel's Power Query/Get & Transform tools can be used for a great deal more than just acquiring data from external sources.
The first part of the series included an overall introduction to how these tools work, what they are capable of and how they might require a different way of thinking compared to 'standard' Excel techniques.
Part two looked at using the Power Query tools to automate the process of appending several Tables of data into one single Table. Part three continued the consideration of appending Tables of data, but this time automatically including tables from particular files or folders.
Where parts two and three looked at joining Tables by appending the rows in one table to the rows in another, in this part we are going to consider how to merge two tables so that we end up with the same number of rows as in the first table, but with the ability to include columns from both tables in the final output. This is the equivalent of using Excel lookup and reference functions such as VLOOKUP() and the MATCH(), INDEX() combination.
As stated in the part one introduction, for convenience, we will refer to Power Query throughout, but Excel 2016 Get & Transform and Power BI Get Data operate in a similar way. Now that we are starting to cover some more advanced Power Query features, it's also worth mentioning that Power Query is updated monthly. If what you see differs from what is shown and described, it might be due to the version of the Power Query tools that you are using.
In the previous examples we started with basically the same data in different tables and we combined them into one long table by keeping the same columns but appending all the rows. This time we will be considering how to combine tables when we want to include all the rows in one of the tables and add some of the columns from the other table.
Content seriesView full content series
There are several possible reasons why our data might be split across multiple tables: we might be importing the tables from external sources or we might just want an efficient way of categorising existing transactions in some way. It is this latter case that we will base our examples on. Parts one and two used tables of sales invoices. We will use the same example but concentrate on just the UK table. We want to group our invoice lines in a couple of different ways; first of all, by product category and then by period. This will require two different approaches.
In this part we will just look at a method which is similar to using the exact form of VLOOKUP() in an Excel formula. In the next part we will look at using the approximate form of VLOOKUP() to create our period grouping.
As we've said, creating our product category grouping will be the equivalent of using the exact form of VLOOKUP(). We will have a table of products and, for each product, a corresponding category. We can then add a column to our invoices table that shows the category for each row. Our Categories Table deliberately doesn't include all of our products and does include some additional products.
As we did in Part one of the 'Tables to queries' section, we will first use the From Table command to turn our Tables into 'Connection Only' queries. We can then use New Query, Combine Queries. We choose Merge rather than Append as we did in part one. The Merge dialog allows us to choose each of our two tables and indicate the columns that join them:
The Join Kind is important. There are several different ways to join the tables and they all affect how unmatched items are treated. The default is 'Left Outer' which, as the text in parentheses explains, will include all the rows in the first (top) table but only matching rows from the second.
As we mentioned previously, tblCategories doesn’t include all of the products that we have invoiced. Of the 154 rows in tblUK, only 142 have a product that matches a ProductName in tblCategories. If we wanted to analyse our invoice total, it would be important that we included all our invoice lines, even if they didn't have a category, otherwise our total would be incorrect. Hence the choice of Left Outer. The other options are:
- Right Outer – the same as Left Outer but the other way around, so all rows from the second, rather than first, table.
- Full Outer – all rows from both tables regardless of whether they match.
- Inner – only rows that match.
- Left and Right Anti – these are much less common and just include rows that don't match. So, if we were to choose Left Anti, we could see the 12 rows where the product in tblUK wasn't included in tblCategories:
The join kind can be changed by clicking on the 'gear' icon next to the Source step. Here we have restored the join kind to Left Outer to include all our invoice rows:
Expand merged tables
A merged table in a query initial shows as a New Column containing 'Table', The Expand/Aggregate icon in the column heading displays a dialog that allows us to expand the table to show any or all of the columns that it contains, or to display an 'aggregate' by applying an aggregate function to any or all of the columns. This is because a table can contain multiple rows as well as multiple columns. We can see the contents of any of our Table 'cells' by clicking in the space next to the word Table. A pane will be displayed at the bottom of the screen:
In our case, our categories table only contains a single row for each product so when we choose to expand just the CategoryName column we will still just have 154 rows but now with the appropriate category information in an additional column. We know that 12 of our invoice rows don't have a product match in the categories table. These will show as 'null' in the CategoryName column. We can select that column and use the Replace Values option to change this to something more suitable such as 'Other':
Of course, the proper solution would be to add the missing products to tblCategories together with the category to which they actually do belong.
If we now choose to Close & Load our query we should see a table in our worksheet that includes all of our UK invoice lines with the Product category added as the rightmost column:
We could have grouped our results by category within our query or, for more flexibility, we can just use a PivotTable based on our table to create whatever summary we need:
You can download the finished example here:
Given the comparatively small number of records in both of our tables, and the simplicity of the join, we could have achieved more or less the same result by adding a column to our UK invoices table that used a VLOOKUP() formula, combined with some conditional logic to cope with non-matches.
The query approach has advantages where our data is already in Power Query and where we need to deal with much larger volumes of data. The query approach can be much quicker when tens, or hundreds, of thousands of rows are involved. The different join types available when merging the tables also make it easier to deal with more complicated types of match.
Next time we'll see whether or not Power Query can cope with the types of match that the approximate form of VLOOKUP() handles.
About Simon Hurst
Simon Hurst is the founder of technology training consultancy The Knowledge Base and is a past chairman of the ICAEW's IT Faculty.