Cutting Big Data down to size

Lumberjack
istock_simonkr
Share this content
Tags

This is the first of a series that will consider the relevance of 'Big Data' to businesses of all sizes. By some definitions, this is not even a sensible subject to consider within Excel Zone.

Wikipedia defines Big Data as: "a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them." Other definitions even state specifically that Big Data is too big to be handled by spreadsheets.

Although only the largest businesses and organisations are likely to have the computing power and expertise to process 'real' Big Data, there are many lessons and techniques from Big Data that can be applied to any business. We will use the tools available within Excel, and also the separate and free Microsoft Power BI desktop application, to see how Big Data techniques could be relevant to us all.

What does Big Data mean?

Although there doesn't appear to be a single, generally agreed definition of Big Data, the one thing that most commentators seem to agree on is that it's not all to do with size. Although the amount of data is important, there are other factors which help distinguish 'Big Data' and many of these involve the techniques used to extract and present the useful information that lies within the data. It is these techniques that we can consider applying to much more manageable sets of data.

Size

Of course, many Big Data projects do indeed involve very large amounts of data, often involving terabytes or petabytes of data. Data this large will take far too long to be handled by desktop computers and will instead require supercomputers or parallel processing. As well as the absolute size in bytes of the data, another characteristic of big data is the use of all the data rather than applying statistical sampling techniques to make the data more manageable.

Structured/unstructured

Before the advent of big data, most data used in business was organised into the typical rows and columns approach of tables within a relational database. Changes in technology and processing power now make it possible to extract meaning from much more free-format data: including web pages, documents and pictures. Big data is often generated automatically via sensors built into devices such as aeroplane engines, mobile phones and fridges.

This is one of the most important changes that underpin big data. For data to be useful, it used to have to be composed of text or figures, held in rows and columns, within a structured database. Now, pictures and sounds have all been digitised, allowing useful information to be extracted from photographs, videos and recorded conversations. In a sense, even sentiment and emotion have also been digitised.

Social Media and our willingness to sacrifice our privacy for the excitement of Twitter, Facebook and product reviews, mean that we digitise a great deal of information about what we are doing and how we feel about things. This data can then be analysed to do anything from deciding how clever we are, to monitoring our current state of health.

Internal/external

Big Data analysis often combines internal data with external sources of data to help make predictions, such as revealing the way the behaviour of customers, or potential customers, might be influenced by environmental factors.

Patterns, correlations and comparisons

A lot of Big Data analysis is concerned with the recognition of patterns, not only as a means of extracting useful information from a vast amount of data but also because the existence of a pattern suggests the ability to predict future behaviour.

Timing

Many big data applications analyse events in real-time rather than using historical data. Continuous streams of data can be analysed to monitor industrial processes, financial transactions or social interactions for example.

Machine learning

Machine learning is often closely associated with big data analysis. Because there is usually far too much data for direct, human interpretation, algorithms are used that can make predictions and decisions based on the data. In a recent AccountingWEB article, Trent McLaren looked at what machine learning is and how it could be applied to accountancy.

Visualisation

A key element of working with big data is the presentation of the end result. Often, the amount of data that needs to be presented precludes the use of simple tables of figures, and instead relies on the use of charts and other graphics to convey the message behind the data. This is an area that has seen significant improvements in desktop software capabilities in recent years. As well as the seven new chart types introduced in Excel 2016, Power Map/3D Map brings animation and satellite imagery to the visualisation of data with a geographical component, and Power BI includes a whole range of visualisation types, including word clouds and animated bubble charts.

Power BI


More than Business Intelligence

Given that the tools available to us in Excel and Power BI fall into the category of Business Intelligence, perhaps one of the best ways to define Big Data in a business context is to see how it differs from Business Intelligence. Wikipedia contrasts the two as follows:

  • Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc.
  • Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviours

Where to start?

As exciting as the possibilities of Big Data are, in many cases there is a lot of work to be done with existing, internal, structured data before exploring the frontiers of external data. The series will cover a selection of different data types including the structured data that could be found in, or extracted from, accounting applications as well as more free-format data such as the words used on Donald Trump's Facebook page.

Because the availability of the Excel Power BI tools varies with the different versions and editions of Excel, we will also be using Power BI Desktop to demonstrate many of the techniques that we will be looking at. The free version can be downloaded here: https://powerbi.microsoft.com/en-us/.

In the next part of the series we will start our exploration of Big Data techniques by using Power BI to combine some internal accounting data with over 1.6 million rows of data downloaded from the Ofcom website.

About Simon Hurst

Simon Hurst

Simon Hurst is the founder of technology training consultancy The Knowledge Base and is a past chairman of the ICAEW's IT Faculty.

 

Replies

Please login or register to join the discussion.

avatar
11th May 2017 10:27

Many thanks

Thanks (0)
avatar
By wainers
11th May 2017 14:00

Excellent introduction, Simon. My own area of work is data analytics using mainly internal structured financial data, so I'm really looking forward to getting my head 'out of the weeds' and into the 'wild flower meadow' of external data and less traditional techniques.

Thanks (0)
12th May 2017 19:24

Thanks for the comments. Glad it was useful. Wild flower meadow here we come! (Although, in my case, that could involve a great deal of sneezing and sore eyes...)

Thanks (0)