Simon Hurst gets his teeth into the different characteristics of Excel's aggregate and summary functions.
Adding up is one of the most common things to do in a spreadsheet. At its simplest, this might involve using the + operator to add up the contents of a couple of cells but, for anything more complicated, one of Excel's aggregate or conditional sum functions is likely to be necessary. When working with significant volumes of data, the best approach might be to use a PivotTable to summarise the data. Specific Excel functions are available to work with individual totals from a PivotTable or even directly from the underlying data. Adding is just one of a range of calculations that can be performed on a set of values and some of the functions that generate totals can also be used to perform these other calculations.
In this short series we’ll look at the various different Excel functions and features that can be used for adding up and otherwise evaluating values. In this first part we will start with simple sums and progress to the more flexible aggregate functions.
Aggregate functions
SUM()
Although almost everyone is familiar with the basic use of the SUM() function, even this function has a few points of interest. SUM() is usually used with a single argument to add up a contiguous range of cells but the function can deal with up to 255 separate arguments. At first glance, you might think that =SUM(A1,A3) and =A1+A3 would be interchangeable, but they can give very different results because of the way that they each process text. SUM() will ignore a text entry whereas + will attempt to evaluate it:
As you can see in the example above, trying to use + where one of the cells contains text will result in an error, but using each cell reference as a separate argument of the SUM() function will not. The difference goes further, if the text entry was made up of numbers, for example a year heading for a column entered as text, + would convert the text value to a number and include it in the total – a process known as coercion.
In practice, this would mean that you could use SUM() to create a consistent formula in a column of data where you need to include a reference to a heading cell.
SUBTOTAL()
SUM() is one of a family of aggregate functions each of which calculates a single value from a range, for example: AVERAGE(), COUNT(), MAX() and MIN(). There are two additional aggregate functions which can do the same thing, but with more flexibility. SUBTOTAL() is available in all recent versions of Excel and is the function that will be used if you use the Data, Subtotal command to subtotal values in a block of cells. Although SUBTOTAL() can produce a simple total in the same way as SUM(), it has an additional argument that can be used to define the type of aggregate and also how values in hidden rows (but note, not hidden columns) should be treated. The first SUBTOTAL() argument is a number between 1 and 11 or between 101 and 111. The number from 1 to 11 defines the type of aggregate, with the 100+ range including the same aggregates but ensuring that hidden rows are not included in the total:
In this example we have hidden row 4 using the Data, Group feature and we can see how SUBTOTAL() with its first argument set to 109 or 9 behaves differently and how it compares to SUM():
Note that, as this example shows, the argument of 109 will exclude values in rows that are hidden either using the Hide command or by 'collapsing' a group of rows. Values that are in rows that are 'hidden' by applying a filter will always be included in SUM() calculations and excluded from SUBTOTAL() whichever of the SUBTOTAL() argument numbers is used:
There is another key difference between using SUM() and SUBTOTAL(). If a SUM() range includes cells that contain other SUM() or SUBTOTAL() calculations, then they will be included in the total, potentially resulting in values being double-counted. SUBTOTAL() will ignore other SUBTOTAL() cells but will include SUM() cells:
There is another, less obvious, difference between SUM() and SUBTOTAL(). SUM() can include '3D' references that refer to the same range of cells on multiple contiguous sheets. Using a 3D reference in a SUBTOTAL() formula will generate a #VALUE! error:
3D references can be created by clicking on the first sheet tab, then clicking on the second sheet tab while holding down the Shift key. You can then let go of the shift key and click on the required cell. The resulting formula will then include the values in the designated cell of each of the specified sheets and all the sheets positioned between them.
AGGREGATE()
Excel 2010 introduced an even more flexible aggregate function named, appropriately enough, AGGREGATE(). AGGREGATE() provides many additional calculation options and more control over the other types of cell values to avoid. The following calculation functions are added:
- MEDIAN
- MODE.SNGL
- LARGE
- SMALL
- PERCENTILE.INC
- QUARTILE.INC
- PERCENTILE.EXC
- QUARTILE.EXC
Rather than the value of the first argument combining the type of calculation and the treatment of hidden rows as for SUBTOTAL(), AGGREGATE() has separate arguments for each, allowing 8 different variations to be provided for:
The percentile and quartile options are worthy of further explanation. I’m not a statistician so if you have a clearer and simpler explanation, or if I’ve just got it completely wrong, please feel free to add an explanatory comment to this article.
Percentiles and quartiles relate to a set of values and how individual values are ranked within the set. For quartiles, the range of values is, in effect, divided into 4 and the quartile function returns the minimum value that would need to be attained to rank in each section. For the third quartile it would return the minimum value needed to rank in the top 25% of values, for the second it would be the minimum value for inclusion in the top 50% (this would be the same as the median value) and for the first quartile it would be the minimum value for inclusion in the top 75% of values. Where necessary, the functions will interpolate to determine the value as you will see in the following example. QUARTILE.INC() also allows the use of 0 and 4 for the quartile argument, these will return the lowest and highest values respectively, and will therefore return the same values as MIN() and MAX().
Percentiles work in exactly the same way as quartiles except that, rather than just dividing the set of data into 4 sections, they divide it into 100 sections. This means that the 25th percentile is equal to the first quartile, the 50th percentile to the second quartile and the 75th percentile to the third quartile. If a value lies in the 99th percentile, it is higher than 99% of the other values.
Hopefully, this example will help explain the concept for fellow non-statisticians: