Create files containing OCR text

About this guide

Aluma makes it easy to OCR large sets of documents. Aluma can process many documents at the same time, and reads text in a multi-page document in parallel so we can give you fast results.

In this guide we'll OCR a small set of scanned documents and create files containing the OCR text.

Working through this guide should take about 3 minutes.

Before you begin

Before you start you must have:

  • Installed the Aluma CLI and logged in to connect it to your account
  • Installed the example documents

If you have not done these steps, follow the Getting started guide and then return to this one.

Get the OCR text from a scanned document

The documents we're going to read the text from are the scanned UK invoices in the /invoices/ukexamples directory.

To create raw text files, we use the CLI's read command and specify the files we want to process. We'll also specify that:

  • we want to create a new text file for each file, using the -m switch
  • we want to show a progress bar, using the -p switch

Enter this command. It will take a few seconds to run:

aluma read examples/invoices/uk/*.* -m -p

Now for each original file in the examples/invoices/uk directory like 001.tif you will have a file named 001.ocr.txt containing the raw OCR text.

Optimising the accuracy of OCR text

To get the most accurate OCR results you should scan images at 300dpi or more. There is rarely any value in going above 300dpi so you should only do this if your documents contain very small text.

Aluma does some noise, line and box removal image processing before reading the text from the document, but if you have access to image cleanup as part of your scan process you should ensure that you configure it to provide the cleanest images you can.