extract
The extract
command extracts data from one or more files using a extractor and writes the results to the console, to a single file or to one output file per file processed. Results can be written in table, CSV and JSON formats.
Basic usage
To extract data from files and write the results to the console, specify the file or files to extract and the name of the extractor to use:
aluma extract extractor-name *.tif
See Selecting files to process for examples of how to use file patterns to select multiple files.
The default output format is the table format. This is an easy-to-read format, but is less well-suited to scripting or integrating with other applications compared to CSV or JSON.
File Name
doc-file.tif NA MADSEN, BRENDA J. CATE,
Fannie Mae, Freddie Mac, Flint,
MI, R. Part
Writing results in CSV format
Extraction results can be written in CSV format. This format makes it easy to consume the output into other commands and tools that need to process the output in some form.
To specify CSV format output use the --format csv
parameter or the shorter -f csv
version.
aluma extract extractor-name *.tif -f csv
Filename,Name
/path/to/documents/doc-file.tif,"NA MADSEN|BRENDA J. CATE|Fannie Mae|Freddie Mac|Flint, MI|R. Part"
CSV format output includes the following fields (in this order) includes the full path to the files and all non-rejected fields in the extraction results. If a field has multiple results, they will be pipe (|
) delimited.
Writing results in JSON format
Extraction results can be written in a JSON format that contains all available extraction data. This format is designed for output into other commands and tools that need to process the output and need access to the more advanced output.
To specify JSON format output use the --format json
parameter or the shorter -f json
version.
aluma extract extractor-name *.tif -f json
The JSON is an array of results, except when using the --multiple-files
/-m
parameter to write a result file per input file in which case the JSON is a single result:
[{
"filename": "C:\\examples\\00001a.pdf",
"classification_results": {
"document_type": "Expenses",
"is_confident": true,
"relative_confidence": 1.6189158,
"document_type_scores": [
{
"document_type": "Expenses",
"score": 49.467617
},
{
"document_type": "Invoice",
"score": 34.63108
}
]
}
},
{
"filename": "C:\\examples\\00001b.pdf",
...
}]
Talk to us if you are using JSON output
We recommend that you chat to us if you think you need to use the advanced properties in the JSON output, so we can help make sure your configuration is optimised and you are using the properties in the correct way. Just reach out to us at [email protected].
Using a read-profile to extract from non-English documents
To specify a read-profile containing a language or languages to use if the document must be read before extraction, use the -r
switch with the name of the profile:
aluma extract extractor-name 001.tif -r read-profile-name
See also
Updated almost 3 years ago