extract

The extract command extracts data from one or more files using a extractor and writes the results to the console, to a single file or to one output file per file processed. Results can be written in table, CSV and JSON formats.

Basic usage

To extract data from files and write the results to the console, specify the file or files to extract and the name of the extractor to use:

aluma extract extractor-name *.tif

See Selecting files to process for examples of how to use file patterns to select multiple files.

The default output format is the table format. This is an easy-to-read format, but is less well-suited to scripting or integrating with other applications compared to CSV or JSON.

File                                Name
doc-file.tif                        NA MADSEN, BRENDA J. CATE,     
                                    Fannie Mae, Freddie Mac, Flint,
                                    MI, R. Part

Writing results in CSV format

Extraction results can be written in CSV format. This format makes it easy to consume the output into other commands and tools that need to process the output in some form.

To specify CSV format output use the --format csv parameter or the shorter -f csv version.

aluma extract extractor-name *.tif -f csv
Filename,Name
/path/to/documents/doc-file.tif,"NA MADSEN|BRENDA J. CATE|Fannie Mae|Freddie Mac|Flint, MI|R. Part"

CSV format output includes the following fields (in this order) includes the full path to the files and all non-rejected fields in the extraction results. If a field has multiple results, they will be pipe (|) delimited.

Writing results in JSON format

Extraction results can be written in a JSON format that contains all available extraction data. This format is designed for output into other commands and tools that need to process the output and need access to the more advanced output.

To specify JSON format output use the --format json parameter or the shorter -f json version.

aluma extract extractor-name *.tif -f json

The JSON is an array of results, except when using the --multiple-files/-m parameter to write a result file per input file in which case the JSON is a single result:

[{                                                                                 
  "filename": "C:\\examples\\00001a.pdf",
  "classification_results": {                                                      
    "document_type": "Expenses",                                             
    "is_confident": true,                                                          
    "relative_confidence": 1.6189158,                                              
    "document_type_scores": [                                                      
      {                                                                            
        "document_type": "Expenses",                                         
        "score": 49.467617                                                         
      },                                                                           
      {                                                                            
        "document_type": "Invoice",                                         
        "score": 34.63108                                                          
      }                                                                            
    ]                                                                              
  }                                                                                
},                                                                                 
{                                                                                  
  "filename": "C:\\examples\\00001b.pdf",
  ...
}]

📘

Talk to us if you are using JSON output

We recommend that you chat to us if you think you need to use the advanced properties in the JSON output, so we can help make sure your configuration is optimised and you are using the properties in the correct way. Just reach out to us at [email protected].

Using a read-profile to extract from non-English documents

To specify a read-profile containing a language or languages to use if the document must be read before extraction, use the -r switch with the name of the profile:

aluma extract extractor-name 001.tif -r read-profile-name

See also