Create searchable PDFs

About this guide

Aluma makes it easy to OCR large sets of documents. Built ground-up for the cloud, Aluma can process many documents at the same time, so we can give you fast results.

In this example we'll OCR a small set of scanned documents and create searchable PDFs with embedded OCR text.

Working through this guide should take about 3 minutes.

Before you begin

Before you start you must have:

  • Installed the Aluma CLI and logged in to connect it to your account
  • Installed the example documents

If you have not done these steps, follow the Getting started guide and then return to this one.

Create searchable PDFs

The documents we're going to create searchable PDFs from are the scanned UK invoices in the /invoices/ukexamples directory.

To create searchable PDFs, we use the CLI's read command and specify the files we want to process. We'll also specify that:

  • we want a PDF as output, using the pdf parameter (we could also get just the raw text by omitting this)
  • we want to create a new PDF for each file, using the -m switch
  • we want to show a progress bar, using the -p switch

Enter this command. It will take a few seconds to run:

aluma read examples/invoices/uk/*.* -f pdf -m -p

Now for each original file in the examples/invoices/uk directory like 001.tif you will have a file named 001.ocr.pdf which is a searchable PDF with embedded OCR text.

Optimising the accuracy of OCR text

To get the most accurate OCR results you should scan images at 300dpi or more. There is rarely any value in going above 300dpi so you should only do this if your documents contain very small text.

Aluma does some noise, line and box removal image processing before reading the text from the document, but if you have access to image cleanup as part of your scan process you should ensure that you configure it to provide the cleanest images you can.