# Coding for accountants part two: Filtering and sorting data

istock_aleksandarnakic_cd

Following on from Part 1, this week we’ll start working with the data we’ve loaded into our jupyter notebook.

To start, login to the Azure Notebooks account you signed up to from the homepage.

It’s up to you, but to keep the code we shared in Part 1 separate we’ll be creating another notebook file called Part2.ipynb. If you want to do this also, load your file from last week, then go to File and click “Make a Copy”. Otherwise feel free to carry on working in the same file.

## Filtering Data

In many cases, we’ll want to filter the data that we’re working with, similar to how we’re able to filter a range of cells in Excel.

Let’s say we want to show all invoices from our imaginary customer “Moonlimited”. To do this, type the following into your notebook remembering to press SHIFT+ENTER to run the code:

## Content series

### Coding: How to build a machine learning algorithm

Note we’ve used the notation df['customer'] to reference the “customer” column. If you run this on its own it simply returns the customer column. It’s worth noting here that we’ve used a double equals sign “==” to return the rows where the customer is “Moonlimited”. It’s a double equal to distinguish with the more common single equal “=” used for assigning variables.

Now let’s say we want to filter all invoices issued before October 2017 which have a value greater than £2000. To do this run the following:

Note we’ve enclosed our conditions in parentheses (this is very important) and used the ampersand symbol “&” to denote both conditions must be true. If instead we want either the first or second conditions to hold we need to use pipe symbol “|” instead.

Exercise 1: how would one go about filtering invoices in November 2017 less than £5000?

## Sorting Data

To sort our data by “total_amount” with the largest value first we need to run:

By default, the sort_values function returns ascending sorted values so we need to explicitly let pandas know that we want our sort to be descending.

If instead we want to sort by “customer” then “total_amount” we need to run:

Exercise 2: how can we sort ascending by “customer” (as shown) and descending by “total_amount”?

Hint: this one is trickier - it would be worth taking a look at the pandas documentation for sort_values here.

We haven’t said much about our dummy debtor invoices data yet, but lets assume the “total_amount” we have is exclusive of VAT and we want to add two more columns: one showing the VAT and the other the total inclusive of VAT. The code for doing this is pretty straightforward. To add a column for VAT we run:

I’ve assumed a VAT rate of 20% here but you can of course change this. Again we can run “df” to see the result of this:

Next to add our column for total amount inclusive of VAT we run:

There are at least a couple of ways you can filter the columns that we have in our df (data frame) object. The first being that we simply drop any columns we’re not interested in, eg:

The “axis=1” here is to indicate that we want to drop columns instead of rows. To drop particular rows we would use “axis=0” and pass in the list of row numbers we want to drop e.g. [0, 2, 3] to drop the 1st, 3rd and 4th rows.

Another way to do this is to explicitly say which columns we want to keep. To do this we can for example run the following to just show the customer and total_amount_gross columns:

## Applying a function

Before jumping in, it’s worth giving a quick overview of how functions work in Python. In the same way the “SUM” function in Excel accepts a range of cells as input and returns their sum, a function in Python accepts 0 or more arguments as input and optionally returns something as an output.

For example we can define a function to add two numbers “a” and “b” together as follows:

We could also write a function to get the top 5 technology stories from the BBC website, like this for example:

Coming back to our debtors data, suppose we want to add an extra column showing the number of days an invoice was paid after its due date. To do this we need to apply a function to each row subtracting the due_date from the paid_date and returning the value as a number in days. The function to do this is:

This function accepts a row from our table as input and returns the date difference in days by looking up the appropriate columns.

To apply the function to our table and create a new column in the process we do the following:

To finish up for this week, we can output the table we’ve created thus far as a CSV file by running the following:

Next week we’ll look at different ways to plot and visualise the data we’ve been working with.

Jahan is the founder and CEO of Indigo Cashflow, an app created to take the pain out of cash flow forecasting. Prior to this he ran a successful software consultancy, developed FX trading models at Bank of America Merrill Lynch using machine learning techniques, and worked as a Mathematics Lecturer at Bristol University. He holds a PhD in Mathematics from Oxford University.

### Replies

15th Feb 2018 15:02

Crikey! I only want to keep a Sales Day Book not plot a course to Jupiter.

Thanks (1)
23rd Feb 2018 13:06

Hi Jahan

I keep getting the "df not defined" error when I try to run the code. I have typed in the code several times and even copies and pasted it from your notebook but it still doesn't seem to work. Any ideas?

df[(df['issue_date']<'2017-10-01') & (df['total_amount']>2000)]

NameError Traceback (most recent call last)
in ()
----> 1 df[(df['issue_date']<'2017-10-01') & (df['total_amount']>2000)]

NameError: name 'df' is not defined

Thanks (0)
to alexm111
24th Feb 2018 14:38

Hi

It looks like the CSV file needs to be loaded into the "df" variable in your jupyter notebook. To do that you'll need to run the line which has

(remembering to press SHIFT+ENTER). I hope that fixes the issue!

Thanks (0)
01st Mar 2018 15:07

Why would some dates in the due date field be imported differently to others ? For the first three records I have a due date in Python of 2nd May, 2nd August and 17th February and in the download csv these dates are showing as 2nd February, 8th February and 17th February.

Thanks (0)
to elliottrob
01st Mar 2018 17:43

It sounds like when you open the file in Excel it shows a US date format so 02/08/2017 would be read as 8th of February rather than 2nd August. It's quite common issue, which I come across enough. The dates in the CSV file should be in the standard ISO 8601 (YYYY-MM-DD) format i.e. 2017-08-02 in this case. Hope that answers the question

Thanks (0)