read
The read
command performs OCR on one or more files and writes the results to the console, to a single file or to one output file per file processed. Results can be written in raw text, searchable PDF and wvdoc formats.
Wvdoc file format
The wvdoc format is the internal format used by Aluma. It includes character-level information, and may be useful in support cases. It is unlikely to be useful to most users.
Basic usage
To read a document and write the results to the console, just specify the file to read:
aluma read 0001.tif
The default output format is raw text. This is easy to read, but doesn't include any positional data about the characters.
To write the output to a file, add the -m
switch. This will create a file in the same directory as the document file containing the text results.
To read multiple files you can use a file pattern instead of a filename. See Selecting files to process for examples of how to use file patterns to select multiple files.
If you read multiple files you should also use the -m
switch as otherwise the text for all of the documents will be written to the console, which is probably not what you want.
Creating searchable PDFs
Read results can also be saved as searchable PDF files.
To specify PDF format output use the --format pdf
parameter or the shorter -f pdf
version. Note the use of the -m
parameter to create a .ocr.pdf
file for each document.
aluma read *.tif -f pdf -m
Using a read-profile
To specify a read-profile containing a language or languages to use for the read, use the -r
switch with the name of the profile:
aluma read 001.tif -r read-profile-name
Updated about 3 years ago