Following on from Part 1, this week we’ll start working with the data we’ve loaded into our jupyter notebook.
To start, login to the Azure Notebooks account you signed up to from the homepage.
It’s up to you, but to keep the code we shared in Part 1 separate we’ll be creating another notebook file called Part2.ipynb. If you want to do this also, load your file from last week, then go to File and click “Make a Copy”. Otherwise feel free to carry on working in the same file.
Filtering Data
In many cases, we’ll want to filter the data that we’re working with, similar to how we’re able to filter a range of cells in Excel.
Let’s say we want to show all invoices from our imaginary customer “Moonlimited”. To do this, type the following into your notebook remembering to press SHIFT+ENTER to run the code:
Note we’ve used the notation df['customer'] to reference the “customer” column. If you run this on its own it simply returns the customer column. It’s worth noting here that we’ve used a double equals sign “==” to return the rows where the customer is “Moonlimited”. It’s a double equal to distinguish with the more common single equal “=” used for assigning variables.
Now let’s say we want to filter all invoices issued before October 2017 which have a value greater than £2000. To do this run the following:
Note we’ve enclosed our conditions in parentheses (this is very important) and used the ampersand symbol “&” to denote both conditions must be true. If instead we want either the first or second conditions to hold we need to use pipe symbol “|” instead.
Exercise 1: how would one go about filtering invoices in November 2017 less than £5000?
Sorting Data
To sort our data by “total_amount” with the largest value first we need to run:
By default, the sort_values function returns ascending sorted values so we need to explicitly let pandas know that we want our sort to be descending.
If instead we want to sort by “customer” then “total_amount” we need to run:
Exercise 2: how can we sort ascending by “customer” (as shown) and descending by “total_amount”?
Hint: this one is trickier - it would be worth taking a look at the pandas documentation for sort_values here.
Adding and filtering columns
We haven’t said much about our dummy debtor invoices data yet, but lets assume the “total_amount” we have is exclusive of VAT and we want to add two more columns: one showing the VAT and the other the total inclusive of VAT. The code for doing this is pretty straightforward. To add a column for VAT we run:
I’ve assumed a VAT rate of 20% here but you can of course change this. Again we can run “df” to see the result of this:
Next to add our column for total amount inclusive of VAT we run:
There are at least a couple of ways you can filter the columns that we have in our df (data frame) object. The first being that we simply drop any columns we’re not interested in, eg:
The “axis=1” here is to indicate that we want to drop columns instead of rows. To drop particular rows we would use “axis=0” and pass in the list of row numbers we want to drop e.g. [0, 2, 3] to drop the 1st, 3rd and 4th rows.
Another way to do this is to explicitly say which columns we want to keep. To do this we can for example run the following to just show the customer and total_amount_gross columns:
Applying a function
Before jumping in, it’s worth giving a quick overview of how functions work in Python. In the same way the “SUM” function in Excel accepts a range of cells as input and returns their sum, a function in Python accepts 0 or more arguments as input and optionally returns something as an output.
For example we can define a function to add two numbers “a” and “b” together as follows:
We could also write a function to get the top 5 technology stories from the BBC website, like this for example:
Coming back to our debtors data, suppose we want to add an extra column showing the number of days an invoice was paid after its due date. To do this we need to apply a function to each row subtracting the due_date from the paid_date and returning the value as a number in days. The function to do this is:
This function accepts a row from our table as input and returns the date difference in days by looking up the appropriate columns.
To apply the function to our table and create a new column in the process we do the following:
To finish up for this week, we can output the table we’ve created thus far as a CSV file by running the following:
We can now see and download this file from our library.
Next week we’ll look at different ways to plot and visualise the data we’ve been working with.
See the code we covered today.