• Sam Hoh

Can a Robot read PDF documents?


One of the starting triggers for most processes is the receipt of a PDF document from an external source. In Human Resource, these could be candidate resumes. In accounts payable, creditor invoices, creditor statements. In payroll, staff timesheets, annual leave requests etc.

So if you were to automate such a process end to end, one of the key questions would be if you can digitally extract the text from these documents. Otherwise, the first step will remain with an employee manually capturing the data and by typing this in manually.

Attempt to obtain data from source

Before you go down the digital extraction path, always check if you can get the data directly from the sender. If you have a high volume of documents from a small number of external parties, ask each party’s account manager if they can also send through the detail via excel, csv or txt (delimited) format. Most organisations that send documents via their ERP system would have the functionality to also attach a separate csv alongside your larger PDF document. Obtaining the data from source will always be more accurate than digital extraction, and would involve less work as the data will already be structured.

If you have a large number of external parties sending through the documents, who can not be easily or practically regulated, then let’s get straight to the digital extraction route.

1) Indexed PDF documents

These are documents that have all its text digitally embedded on it. You will simply need a PDF-to-text extraction tool. Be sure to research a variety of tools, and the types of outputs they provide. The tools should ideally return will all words on the document, alongside X/Y Coordinates for each word. Most PDF-to-Text extraction tools will be able to index these with high accuracy as all this data would have been embedded by the software that created the PDF.

Unsure if your PDF is indexed? You can test this by opening your PDF document in Adobe or any of your PDF readers, select the whole document with its select tool, and attempt to copy this (Ctrl + C). If you can paste all text content into notepad, chances are that your PDF is already indexed.

A Live example – Automating Accounts payable

Fortunately, we are working on Jim The Robot’s PDF-to-Text functionality as we speak, and indexed PDFs is the format that majority of creditor invoices and statements are sent to our clients in.

We are running Jim’s Brains on a platform called Node.js which runs on javascript, and have found a common javascript PDF-to-text library, called pdf2json. I wouldn’t go into the technicality of this as this article is not the right forum to do so. We did however, need to merge words together into blocks, as each of the library’s output were rows of single texts. We had to develop the coding a bit further to figure out how big a PDF ‘space’ was, and if we found another word within that space, it was merged into the same block. Here is a PDF invoice, and snippet of the outcome we generated from its tool:



What you see are ‘blocks’ of words, alongside their X and Y coordinates on the page. We will then set up rules to teach Jim to search for specific parameters, such is the invoice date ’13 Feb 2020’ being the next vertical line down from ‘Invoice Date’. ‘Capital Cab Co’ is on a different X-position, hence excluded from the vertical line group that Jim searches. Hence you can see how important the coordinates are. They enable you define the location of specific parameters, as some of these may be located 1 or 2 blocks to the left, right, above or below the field name.

It is also worth pointing out that all suppliers will have a variety of invoice layouts, so you will need to configure your automation tool to recognise these as they come in. You will essentially build up a ‘Map’ of all invoice templates per supplier. Depending on the functionality of the software, you may be able to configure it to recognise the invoice layout, and apply pre-configured layout maps. Some suppliers will be using the same accounting software, such as in the SME space where there are only a handful of them out there, so it wouldn’t be hard to map out all of them.

2) Unindexed PDF documents

These documents do not have any words digitally embedded on it from its originating software. These commonly are PDFs that come from a scanner or perhaps from a mobile phone camera.

To extract words from these documents, you need a tool with OCR (Optical Character recognition) capability. OCR is used to place digital text on the document based on image recognition. The entire document turns into an image for the tool to process. Tesseract is the leading open-source OCR engine developed and relied upon globally by developers.

However, there is more work to do after the OCR tool has extracted the data. You may need to deal with the following:

1) Word quality - If the document is not of sufficient quality, the OCR tool will incur errors during mapping, which may result in corrective work. Some better OCR tools utilise an inbuilt dictionary to auto-correct words during processing, with an adjustable margin of error.

2) The coordinates of the words on the page will never match their original positions. This is important as you will be relying on these vertical and horizontal positions of words to properly index document fields, as detailed in the prior section.

Some innovative scanners that have inbuilt OCR software that embeds words onto each document during scanning. Microsoft has also released a product in 2015 called Microsoft Lens, a mobile application that embeds words into incoming pictures taken from your phone camera (presumably from a whiteboard during team brainstorming sessions).

After OCR has been run on the document, you will then run it through a PDF-to-Text extraction tool detailed above.

Wrapping up

In conclusion, receiving indexed PDF documents makes your automation project much easier, and in today’s work environment, most incoming documents will be in this format. If you are receiving unindexed documents simply because your internal managers need to physically sign it, some stakeholder persuasion may be in order to migrate this approval process online if this is possible. There is a wealth of approval document tools available in the market.

If your organisation or department’s processes are driven by processing high volumes of incoming documents, I would encourage you to explore document indexing. This is an area that I feel is significantly overlooked, and with some time and research, could be an area where you save a large amount of employee time.

Happy automating.

#futureofwork #automation #rpa #robotprocessautomation #financialautomation #technology #digital #accountants #digitaltransformation #career #careerdevelopment #accountingandaccountants #digitalworkforce #processautomation


4 views0 comments

Recent Posts

See All