classify

The classify command classifies one or more files using a classifier and writes the results to the console, to a single file or to one output file per file processed. Results can be written in table, CSV and JSON formats.

Basic usage

To classify files using a classifier and write the results to the console, specify the file or files to classify and the name of the classifier:

aluma classify classifier-name *.pdf

See Selecting files to process for examples of how to use file patterns to select multiple files.

The default output format is the table format. This is an easy-to-read format, but only includes the most commonly-used output fields. Specifically, the "relative confidence" and individual document type scores are omitted.

FILE                                 DOCUMENT TYPE                    CONFIDENT
00001a.pdf                           Expenses                         true
00001b.pdf                           Invoice                          true

Writing results in CSV format

Classification results can be written in CSV format. This format makes it easy to consume the output into other commands and tools that need to process the output in some form.

To specify CSV format output use the --format csv parameter or the shorter -f csv version.

aluma classify classifier-name *.pdf -f csv
C:\examples\00001a.xlsx,Expenses,true,1.619
C:\examples\00001b.pdf,Invoice,true,3.159

CSV format output includes the following fields (in this order):

  • Full file path
  • Document type
  • Confident (true or false)
  • Relative confidence

You can pipe CSV format output to the Powershell ConvertFrom-Csv cmdlet to select specific results from the output of the classify command. In this case we're selecting results where the document type is "Invoice".

aluma classify myclassifier *.* -f csv | `
   ConvertFrom-Csv -Header "File", "Type", "Confident", "Confidence" | `
   where { $_.Type -eq "Invoice" }

Writing results in JSON format

Classification results can be written in a JSON format that contains all available output fields. This format is designed for output into other commands and tools that need to process the output and need access to the more advanced output fields.

To specify JSON format output use the --format json parameter or the shorter -f json version.

aluma classify classifier-name *.pdf -f json

The JSON is an array of results, except when using the --multiple-files/-m parameter to write a result file per input file in which case the JSON is a single result:

[{                                                                                 
  "filename": "C:\\examples\\00001a.pdf",
  "classification_results": {                                                      
    "document_type": "Expenses",                                             
    "is_confident": true,                                                          
    "relative_confidence": 1.6189158,                                              
    "document_type_scores": [                                                      
      {                                                                            
        "document_type": "Expenses",                                         
        "score": 49.467617                                                         
      },                                                                           
      {                                                                            
        "document_type": "Invoice",                                         
        "score": 34.63108                                                          
      }                                                                            
    ]                                                                              
  }                                                                                
},                                                                                 
{                                                                                  
  "filename": "C:\\examples\\00001b.pdf",
  ...
}]

Using a read-profile to classify non-English documents

To specify a read-profile containing a language or languages to use if the document must be read before classifying, use the -r switch with the name of the profile:

aluma classify classifier-name 001.tif -r read-profile-name

See also