The read command performs OCR on one or more files and writes the results to the console, to a single file or to one output file per file processed. Results can be written in raw text, searchable PDF and wvdoc formats.


Wvdoc file format

The wvdoc format is the internal format used by Aluma. It includes character-level information, and may be useful in support cases. It is unlikely to be useful to most users.

Basic usage

To read a document and write the results to the console, just specify the file to read:

aluma read 0001.tif

The default output format is raw text. This is easy to read, but doesn't include any positional data about the characters.

To write the output to a file, add the -m switch. This will create a file in the same directory as the document file containing the text results.

To read multiple files you can use a file pattern instead of a filename. See Selecting files to process for examples of how to use file patterns to select multiple files.

If you read multiple files you should also use the -m switch as otherwise the text for all of the documents will be written to the console, which is probably not what you want.

Creating searchable PDFs

Read results can also be saved as searchable PDF files.

To specify PDF format output use the --format pdf parameter or the shorter -f pdf version. Note the use of the -m parameter to create a .ocr.pdf file for each document.

aluma read *.tif -f pdf -m

Using a read-profile

To specify a read-profile containing a language or languages to use for the read, use the -r switch with the name of the profile:

aluma read 001.tif -r read-profile-name