Aluma makes it easy to OCR large sets of documents. Built ground-up for the cloud, Aluma can process many documents at the same time, so we can give you fast results.
In this example we'll OCR a small set of scanned documents and create searchable PDFs with embedded OCR text.
Working through this guide should take about 3 minutes.
Before you start you must have:
- Installed the Aluma CLI and logged in to connect it to your account
- Installed the example documents
If you have not done these steps, follow the Getting started guide and then return to this one.
The documents we're going to create searchable PDFs from are the scanned UK invoices in the
To create searchable PDFs, we use the CLI's
read command and specify the files we want to process. We'll also specify that:
- we want a PDF as output, using the
- we want to create a new PDF for each file, using the
- we want to show a progress bar, using the
Enter this command. It will take a few seconds to run:
aluma read examples/invoices/uk/*.* -f pdf -m -p
Now for each original file in the
examples/invoices/uk directory like
001.tif you will have a file named
001.ocr.pdf which is a searchable PDF with embedded OCR text.
To get the most accurate OCR results you should scan images at 300dpi or more. There is rarely any value in going above 300dpi so you should only do this if your documents contain very small text.
Aluma does some noise, line and box removal image processing before reading the text from the document, but if you have access to image cleanup as part of your scan process you should ensure that you configure it to provide the cleanest images you can.
Updated over 1 year ago