Save content
Have you found this content useful? Use the button above to save it to your profile.
vacuum cleaner
istock_tanya-stock_vc

The importance of cleaning up data

by

In this new content series, Excel MVP Liam Bastick goes through the steps you should follow to turn data into useful information. The first article in this series focuses on the importance of cleaning up data.

30th Jan 2020
Save content
Have you found this content useful? Use the button above to save it to your profile.

Many analysts use Excel to summarise cleaned-up data and convert it into information, with structured (sorted and/or ordered) referencing, so that users may understand their numbers, identify trends and produce various reports.

To understand what I mean, as an example, we may receive data like this:

Excel table with data

and we may turn it into something more informative, such as this:

Informative table

Each attribute (known as a field) in a table typically contains a category of data. A crosstab query (similar to the one pictured, above) summarises the data from one or more of these fields that are separated into groups based on one or more fields.

Essentially, a crosstab query summarises data in more than one dimension to provide insight that may not be immediately forthcoming otherwise. For example, given the source data, how could you possibly notice that Sales for Quarter 4 in the South are very low relative to other quarters and regions?

Excel has many useful tools that will assist, centred on its version of an enhanced crosstab query, known as a PivotTable. However, the subject here is not PivotTables.

There is an important first stage I want to discuss here: that is, turning data into information we can use – and that requires cleaning up our data.

Cleaning your data

Data is central to everything we do in Excel. When you receive data for analysis, you don't always have control over the format, structure, positioning/placement, appearance and/or type of data that you import from an external data source, such as a database, text file or a web page.

Before you can analyse the data, you often need to clean it up. Excel has many features that may assist. Sometimes, the task is straightforward and there is a specific feature that does the job for you. For example, you can spell check, remove duplicates or remove all emboldening.

At other times, you may need to manipulate one or more columns by using a formula to convert the imported values into new values by either replacing certain text strings, or removing non-printable characters or excess spaces.

For instance, the following is a particularly important point. Data cleansing is important. SUM, SUMIF, SUMIFS and PivotTables are all affected by the following issue. Consider the following SUM illustration:

SUM Excel

In this example (above), the values in cells E3:E7 have been totalled in two distinct ways: the first uses SUM and the second uses the ‘+’ operator:

Excel table

In this slightly amended example (above), cell E5 has been modified. It has been stored as text, even though it looks like the number 3. SUM treats this as having zero value whereas the more convoluted addition carries on regardless. You might think you can spot the issue in such a small dataset, but what about the one below?

Excel table

There is a simple way to check using the COUNT function. COUNT counts the number of numbers in a range, so we can use it to spot numbers that aren’t numbers:

Count Excel

Here, the formula in column I highlights when a number is not a number. Note how it reports by exception: if the cell in question contains a number then COUNT(Cell_Reference) equals 1 and =1-COUNT(Cell_Reference) equals zero. Only non-numbers will be highlighted – it’s better to know I have two errors rather than, say, 14,367 values working correctly.

SUM, SUMIF, SUMIFS and PivotTables all treat numbers stored as text as zero in mathematical aggregations. We need to clean the data so that the data that appears to be numerical is indeed numerical.

That’s not the only type of cleaning we should consider. Often, you should consider Get & Transform on the ‘Data’ tab for manipulating your data in Excel. This powerful Extract, Transform and Load (“ETL”) tool is especially useful for creating connections to data (“extraction”), manipulating as you require (“transformation”) and linking / storing it in Excel (“loading”).

Not only is it powerful, it may be parametrised and repeated so that you don’t keep having to make similar adjustments to your data every time it is updated.

In the meantime, here are some steps you can take “generically” when you wish to clean data in Excel:

  1. Import the data from an external data source
  2. Create a backup copy of the original data in a separate workbook
  3. Ensure that the data is in a tabular format of rows and columns with similar data in each column, all columns and rows visible, and no blank rows within the range
  4. Consider putting your data in a Table (on ‘Insert’ tab of the ribbon, select “Table’ from ‘Tables’ grouping, Ctrl + T)
  5. Do tasks that don't require column manipulation first, such as spell-checking (on the ‘Review’ tab of the ribbon, select ‘Spelling’ from the ‘Proofing’ grouping, Alt + R + S or F7) or using the ‘Find and Replace’ (select ‘Replace…’ from the drop-down list in ‘Find & Select’ in the ‘Editing’ grouping of the ‘Home’ tab of the ribbon, Ctrl + H) dialog box
  6. Next, do tasks that do require column manipulation. The general steps for manipulating a column are:
    1. Insert a new column (call it “column 2”) next to the original column (call it “column 1”) that needs cleaning
    2. Add a formula that will transform the data at the top of the new column (column 2)
    3. Fill down the formula in the new column (2). In an Excel table, a calculated column is automatically created with values filled down
    4. Select the new column (2), copy it and then paste as values into the new column (2)
    5. Remove the original column (1), which converts the new column from column 2 to column 1.

If for some reason, Get & Transform is not a preferable approach, to periodically clean the same data source, consider recording a macro or writing code to automate the entire process.

In the next article in this series, we look at some of the techniques you can follow to deal with data cleansing issues.

Tags:

Replies (0)

Please login or register to join the discussion.

There are currently no replies, be the first to post a reply.