top of page

Can a Robot Read a PDF Document?

One of the starting triggers for our robots is the receipt of a PDF document from an external source. For example, in a human resource context, these could be candidate resumes. In accounts payable, these would come up as credit or invoices, creditor statements. In payroll, staff timesheets, annual leave requests etc. You get the idea.

These types of documents usually come to us in 2 types:

1) Indexed PDF documents

These are documents that have all its text digitally embedded on it and you wouldn't need a robot to do this. You will simply need a PDF-to-text extraction tool. Be sure to research a variety of tools, and the types of outputs they provide. The tools should ideally return will all words on the document, alongside X/Y Coordinates for each word. Most PDF-to-Text extraction tools will be able to index these with high accuracy as all this data would have been embedded by the software that created the PDF.

Unsure if your PDF is indexed? You can test this by opening your PDF document in Adobe or any of your PDF readers, select the whole document with its select tool, and attempt to copy this (Ctrl + C). If you can paste all text content into notepad, chances are that your PDF is already indexed.

2) Unindexed PDF documents

These documents do not have any words digitally embedded on it from its originating software. These commonly are PDFs that come from a scanner or perhaps from a mobile phone camera.

To extract words from these documents, our robots come with an OCR (Optical Character recognition) capability. OCR is used to place digital text on the document based on image recognition. The entire document turns into an image for the tool to process. This technology is built by Tesseract who are the leading open-source OCR engine developed and relied upon globally by developers.

However, there is more work to do after the OCR tool has extracted the data. Our robots will still need to deal with the following:

1) Word quality - If the document is not of sufficient quality, the OCR tool will incur errors during mapping, which may result in corrective work. Some better OCR tools utilise an inbuilt dictionary to auto-correct words during processing, with an adjustable margin of error.

2) The coordinates of the words on the page will never match their original positions. This is important as you will be relying on these vertical and horizontal positions of words to properly index document fields, as detailed in the prior section.

Some innovative scanners that have inbuilt OCR software that embeds words onto each document during scanning. Microsoft has also released a product in 2015 called Microsoft Lens, a mobile application that embeds words into incoming pictures taken from your phone camera (presumably from a whiteboard during team brainstorming sessions).

After OCR has been run on the document, you will then run it through a PDF-to-Text extraction tool detailed above.

Wrapping up

In conclusion, receiving indexed PDF documents makes your automation project much easier, and in today’s work environment, most incoming documents will be in this format. If you are receiving unindexed documents simply because your internal managers need to physically sign it, some stakeholder persuasion may be in order to migrate this approval process online if this is possible. There is a wealth of approval document tools available in the market.

If your organisation or department’s processes are driven by processing high volumes of incoming documents, I would encourage you to explore document indexing. This is an area that I feel is significantly overlooked, and with some time and research, could be an area where you save a large amount of employee time.

Happy automating.

105 views0 comments

Recent Posts

See All


bottom of page