Reading documents (OCR)
About this guide
In this example we'll OCR a small set of scanned documents and create searchable PDFs with embedded OCR text.
Working through this guide should take about 3 minutes.
The CLI makes it easy to OCR files using Aluma and create searchable PDFs or just retrieve the raw OCR text. This article explains how to use the read
command to OCR files and obtain results in the format you want.
Supported file types
You can OCR single or multi-page PDFs containing scanned images. If you OCR a multi-page PDF, every page will be processed and included in the output.
To get the most accurate OCR results you should:
- Scan at 300dpi or greater
- Enable image cleanup options on your scan software, in particular deskew, noise removal and border removal
You can also OCR digitally-created PDFs, in which case the content is converted to an image and then processed.
Note that if you are using Aluma classification or extraction then OCR is performed automatically (for image files) or content is taken directly from digital PDFs or other digital documents. You do not need to use the read
command in these situations.
Read documents with different (non-English) languages
If your documents contain text in a language that is not English, you can specify the appropriate language(s) so you get optimal results by creating and using a read-profile. See Read documents with different languages for instructions on how to do this.
Create a searchable PDF from a scanned document
To OCR files you will use the aluma read
command. This takes two parameters: the file (or files) to process and the format you want the results in (PDF, plain text, or a wvdoc).
To OCR a single file and create a searchable PDF, with the OCR text embedded in it, use the following command:
aluma read scan001.pdf -f pdf -m
The -f pdf
parameter specifies that the CLI should generate a searchable PDF.
The -m
parameter specifies that the output file should be in the same directory as the input file and with the same filename but with a .ocr.pdf
extension. In this example the file created would be scan001.ocr.pdf
.
Alternatively, if you are processing a single file you can specify the output file using the -o
parameter like this:
aluma read scan001.pdf -f pdf -o ocr001.pdf
Get the OCR text from a scanned document
In some situations, such as passing OCR text to a search engine or full text index, you may prefer to obtain the raw OCR text for the document.
To OCR a single file and create a file containing the text, use the following command:
aluma read scan001.pdf -f txt -m
The -f txt
parameter specifies that the CLI should generate text output.
The -m
parameter specifies that the output file should be in the same directory as the input file and with the same filename but with a .ocr.txt
extension. In this example the file created would be scan001.ocr.txt
.
Alternatively, if you are processing a single file you can specify the output file using the -o
parameter like this:
aluma read scan001.pdf -f txt -o ocr001.txt
Output OCR text to the console
You can output OCR text for a file directly to the console if you want to give it a quick visual check, by omitting the -m
and -o
parameters:
aluma read scan001.pdf
OCR multiple files
The CLI makes it easy to OCR large sets of files, and will send multiple files to the Aluma service at the same time so you get results quickly. To OCR multiple files, you can specify a file pattern that matches multiple files or directories.
For example, the following command will OCR all files with the extension .pdf
in subdirectories of the /scan
directory and create a corresponding .ocr.pdf
output file for each.
aluma read scan/**/*.pdf -f pdf -m -p
The -p
(or --progress
) parameter displays a progress bar so you can see how many files have been processed and how many are pending.
See Selecting files to process for examples of how to use file patterns to select multiple files.
Moving output files to a different directory
When you process multiple files you must use the -m
option, which creates output files in the same directory as the original input file. If you want to move the output files to a separate directory, you can do this with one of the following commands (on Windows).
If all your files are in a single directory, /scan
in this example, to move them to /ocr
you can do this:
mv scan/*.ocr.pdf ocr
If your files are split over multiple sub-directories, the following Powershell command will copy all *.ocr.pdf
files in all sub-directories to another directory, preserving directory structure:
Copy-Item ./scan ./ocr -filter *.ocr.pdf -Recurse
Updated about 3 years ago