Turn PDF Payslips Into a Single CSV Report

Ever programmed with PDF files? Write a Python script with me!

11 min read3 days ago

As you will see in the upcoming blog posts, I am in my financial literacy era. With the end of the year coming, I wanted to look at my numbers: How much taxes did I pay? How much did I earn for on-call shifts? Multiple PDF files are not the most comfortable way to see this data, and I wanted a single CSV file I could play with in Excel.

Like many good developers, I was too lazy to insert the numbers manually, so I wrote a script. If you enjoy programming — join me on an adventure! And if you are not in the mood — I’ll show you how to tweak the code to match your payslip structure :D

This script receives a directory with payslip PDFs and returns a CSV file with the desired data

The Plan

main.py:
# translate 1 pdf to 1 dict
# loop over the pdf dir
# save all dicts to 1 json file
# translate json report to csv report

We will start by writing the code to read a PDF, including deciding which fields we want in our report. This is the part you’ll need to tweak to match your payslip structure. Once that is figured out, we’ll iterate the whole payslips directory.

In the third step, I chose to add an extra step between the PDF and the CSV — a JSON report. Once we see everything works, we will remove the use of that file.

Lastly, we’ll translate that JSON data into a CSV file. That CSV can then be easily converted to Google Sheets (just click “open with”) or to Excel (instructions can be found here).

That’s an easy nice plan, but you know how it goes — challenges are discovered along the way… Can you guess where things might get complex?

Before we start — an important note: KEEP YOUR PAYSLIPS PRIVATE! If you upload your project to GitHub — be sure not to share those personal details! You can use .gitignore for this:

/payslips_pdf
pdf_rows.txt
report.json
report.csv

Shall we begin?

Read PDF File

We’ll start by reading the PDF and printing all the rows. That way we will know what appears in each row. This only needs to be done once (while a report will probably be created once a month or once a year), and it’s not part of the report -so we will create that in a separate file.

Start by creating a new Python file (I called mine pdf_to_txt.py) and write a function that reads the pdf and prints the result into a .txt file:

This will work, but there are 3 changes to make it more user-friendly:

Get the file path as a command line argument, so the user can run it without touching the code.
Add some error-catching with instructions in case the user runs the command wrong. (I added a warning color but you don’t have to)
We will read PDF files in the main script as well, so moving this function there is better.

Give it a go! Try running the command with the right and the wrong arguments :)

When you run py ./pdf_to_txt.py <PATH_TO_PDF>you will see a new file in the project directory called pdf_rows.txt.

(Don’t worry, this is not my data — it is based on the example payslip from the image below)

Process The PDF Data

Now that we know the structure in which the PDF is read — we can fish the desired values. In my case — here is the information I was interested in:

Notice there is data within the table (categories that might vary every month) and data outside the table.

Data outside the table:

Pay Period — Can be found on row 19

Gross Pay — This one was tricky to find a rule for because it appears after the payments list and doesn’t have the title “Gross Pay”.

As mentioned earlier, payments and deductions can vary, and not every month is the same. Therefore, gross pay might appear in a different row in different months.

I did notice it appears right after the employee name — so that is what I used. Start by adding it hard-coded, and later we will get it externally.

Nett Pay: This one is easy — it appears in line 17.

I gathered those out-of-table values into a function:

Data inside the table

Payment and Deduction Details: This is the juicy part! We’ll start by cutting the rows array to save a few milliseconds in the for loop coming up. Then, I needed to differentiate between list items and other rows.

I’ve noticed that within the whole file, list items are the only ones that match this rule: Start with an alphabetic character and end with a numeric character and contain a space (the last condition is to filter out wrong rows in my payslip, you might not need that).

Now that we have the group of lines, let’s process them to save in the JSON object. At this point, we don’t mind if it’s a payment or a deduction.

For example, we’ll look at the pension item:

PENSION     G   150.00   587.49

I don’t care about the balance (the number on the right), but I do care about the code (G means it’s deducted from the Gross pay — before taxes — and N means it is deducted from the Nett pay — after taxes). So ideally, we’ll have json_obj["Pension (G)"]=150.00.

We’ll use the spaces to split the line. It’s good there are duplicate spaces — that way we can differentiate between space splitting between a couple of words and space splitting between a couple of fields.

The description:

We will find the first double-space and split by it.

The code:

The amount of spaces is dependent on the length of the description, so we can’t know in advance how many are there — that’s why I’ll use lstrip() as well. Now the rest of the line starts with a non-space character.

Not all list items have a code, so we want to check if the line starts with a code or a digit. If it’s a code — I wrap it in () (including a space before the opening parenthesis) , and attach it to the description string. and if not — add nothing.

The amount:

If there was code — we’ll have more spaces to strip. If not, our line might contain two amounts: The monthly and the balance.

There are 4 cases I’ve noticed:

# SALARY      Y   1234.67
# AVC PRCTG   G   1234.00   2345.00
# RSU TAX WH  N   -1234.00  -2345.56
# RSU TAX WH  N             -2345.56

After extracting the category and code, we are left with:

# 1234.67
# 1234.00  2345.00 (we want the amount on the left)
# -1234.00  -2345.56 (we want the amount on the left)
# -2345.56 (we want the amount on the left, which is none)

To cover cases 2–3, we’ll find the index of the spaces separating the amounts and cut the tail. It also works for the first case, where there is no space (aka no tail).

To cover case 4, I’m relying on the difference between two types of categories with a single amount in the row: The first one is like the salary — where we want to save the amount, and the second type is like the tax withholds — where we want to ignore it. The difference is that only deductions keep track of the annual balance in the table — so I am checking for -.

All together, that’s how it looks like:

Write to JSON File

This is not a mandatory step — we can work with a JSON object without exporting the values. I prefer seeing what it looks like, at least for the coding stage.

At this point, you can run py ./main.py and see a new report.json pop up.

Scale to Multiple PDF Files

The only reason this step is getting a dedicated section is because wrapping the pdf_to_dict in a for loop reveals an unpleasant surprise. To demonstrate it, I created a function called iterate_over_pdfs():

This is happening because the list of files is sorted by alphabetic order, so 10 appears before 2. Having the report entries in chronological order can be considered crucial, and not just a nice-to-have feature. Therefore, we need to fix it!

Originally, I thought I’d have to rename the files (Payslip1.pdf -> Payslip01.pdf), but there is a better solution: Sort the list by length. That way, 10 will appear after 2. Then we can also tackle the b'<STRING>' structure by decoding. Lastly, after running the script on Mac, I discovered the directory includes hidden files – so we need to filter them out.

After adjusting the main function to loop over the new list we created — the report.json file will contain an array of payslips (give it a try!).

Create the CSV Report

Because the items in payments and deductions might vary from payslip to payslip, this section is more than just direct translation. CSV is a relational dataset, which means we need to know in advance all the categories in payments and deductions and keep the entry empty for a payslip where it doesn’t exist. JSON, on the other hand, is non-relational and each entry specifies its keys.

With that in mind, the first step in our CSV report is to collect the categories. All the categories.

Collect the categories:

Now, at first glance, you might think to use Set for that — because we want all categories to appear only once. I’ve tried that. The problem with this is that sets are unlisted, and I find it important to match the order of items that appear in the original payslips. When using lists, don’t forget to check if the item exists in the list before appending it:

Now that we have this figured out: Remember earlier when we said we don’t care about which item list is a payment and which is a deduction? Well, we do care now! We don’t have to separate, but I’d expect a payslip report to have all payments on the right and all deductions on the left, not mixed.

Although each payslip might have different list items — some will always exist (because you’ll always pay your taxes ;) ). We can use this to our benefit — and flag PAYE as the start of the deductions! (I'm pretty sure PAYE is only in Ireland, so you'll need to change it to match your payslip)

Finally, I return a single list, because there is no use in separating the payments from the deductions — the split was to assure payments will appear on the right and deductions will appear on the left.

Populate the CSV table:

Now that we have the categories, we can start populating the CSV table:

Each payslip will be a row, and each row will have the fields in a specific order split by a comma. I find it easier to organize the fields in a list, and then join them. Fields that appear in the categories but not in the payslip — will remain empty:

Lastly, we will write to the CSV file:

After this, you’ll have a nice CSV report with all your payslips!

You can make it easier to read by downloading the VS extension RainbowCSV (or any parallel of another IDE)

Remove the use of the .json file

Once we know things work, we don’t need to write to the JSON file: Remove the section of # save dict to json file.

Add a success message

It’s nice to notify the user that the script is done successfully. for that, we will need to expand print_warning() to print_in_color(), and add the option for green color:

Don’t forget to update the call in pdf_to_txt.py as well!

Get the employee’s name as an argument

Once you have your script ready, you’ll want to share it with your colleagues and friends! For a nice user experience, we’ll export the employee name to the command line rather than ask them to open the code.

Read argument

We’ll start with the happy path — assuming the user entered the employee name — and add code that uses it:

In pdf_to_dict(), instead of hard coding EMPLOYEE_NAME = "IFAT NEUMANN", we'll read it from the arguments: employee_name = sys.argv[1]. Don't forget to import sys!

Now we’ll think of other scenarios:

No employee name was given

What if the user did not enter any employee name? We’d wanna catch it as soon as possible, and notify them!

Therefore, we’ll add a check in the first line of the main function. Now, the intuition is to initialize the employee_name variable there - but this will cause bubbling function properties until it reaches the function that uses this variable - and I don't find it a very clean approach.

Lastly, I’ll just try accessing this field — and catch if it’s not there:

Note that adding exceptions means the print_warning() function moves to main.py. Otherwise, you’ll get an error:

ImportError: cannot import name 'print_warning' from partially initialized 
module 'pdf_to_text' (most likely due to a circular import)

Employee name without quotes

We are asking the user to wrap the name with quotes because command line arguments are split by spaces. The only argument we expect is the employee name, so if there is another argument — we know they did not use quotes and we can notify them:

You can skip the quotation requirements and loop over arguments, collecting all the parts of the user name — but I find this approach adds unnecessary complexity.

The employee’s name doesn’t appear on the payslip

We won’t be able to find gross pay if we don’t have the employee’s name as it appears on the payslip.

The earlier spot to identify the mismatch is when we read the pdf. Therefore — well add the check at the beginning of the pdf_to_dict() function:

Note that I moved the employee_name variable to this function because it’s easier to read (rather than if sys.argv[1] not in rows). After that, I pass it to get_fixed_values().

Lastly, we catch the error back in the main function:

To wrap it all up, don’t forget to update your README file with the new instructions.

The full script

Here is the full code of main.py:

And here is pdf_to_txt.py (note it’s calling functions from main.py):

Hope you enjoyed this one! Interested in another weekend coding project? Check out my automated email-sending blog post!

Blogging is my hobby, so I happily spend time and money on it. If you enjoyed this blog post, putting 1 euro in my tipping jar will let me know :) Thank you for your support!