Create searchable PDFs
About this guide
Aluma makes it easy to OCR large sets of documents. Built ground-up for the cloud, Aluma can process many documents at the same time, so we can give you fast results.
In this example we'll OCR a small set of scanned documents and create searchable PDFs with embedded OCR text.
Working through this guide should take about 3 minutes.
Before you begin
Before you start you must have:
- Installed the Aluma CLI and logged in to connect it to your account
- Installed the example documents
If you have not done these steps, follow the Getting started guide and then return to this one.
Create searchable PDFs
The documents we're going to create searchable PDFs from are the scanned UK invoices in the /invoices/uk
examples directory.
To create searchable PDFs, we use the CLI's read
command and specify the files we want to process. We'll also specify that:
- we want a PDF as output, using the
pdf
parameter (we could also get just the raw text by omitting this) - we want to create a new PDF for each file, using the
-m
switch - we want to show a progress bar, using the
-p
switch
Enter this command. It will take a few seconds to run:
aluma read examples/invoices/uk/*.* -f pdf -m -p
Now for each original file in the examples/invoices/uk
directory like 001.tif
you will have a file named 001.ocr.pdf
which is a searchable PDF with embedded OCR text.
Optimising the accuracy of OCR text
To get the most accurate OCR results you should scan images at 300dpi or more. There is rarely any value in going above 300dpi so you should only do this if your documents contain very small text.
Aluma does some noise, line and box removal image processing before reading the text from the document, but if you have access to image cleanup as part of your scan process you should ensure that you configure it to provide the cleanest images you can.
Updated about 3 years ago