This article is an advanced explanation of the different elements of a classification result, how they are calculated and how to use them.
When you classify a document you get a result that includes three core properties:
- The document type that Aluma believes the document to be
- A boolean is confident flag that indicates whether Aluma is confident that document type is correct
- A numerical relative confidence score that gives an indication of how confident Aluma is that document type is correct.
Additionally, if you are using the Aluma CLI JSON output format or using the API directly, you get an array of scores for each document type that the classifier was trained on.
Document type scores are an initial indicator of how likely the document to be classified is that particular document type. They are numbers between 0 and 100, and the results are presented in descending order.
The largest document type score determines the "best guess" of the classifier, which is returned as the document type result property.
The individual scores are used as inputs into the confidence algorithm, the result of which is returned in the is confident flag. This indicates whether the result should be trusted as being "correct" by any subsequent process.
The flag is calculated as follows:
is confident = (confidence > threshold[best document type])
The is confident flag is true if the confidence is above a threshold. The thresholds are decided per document type, using a process called ‘auto-tuning’. Auto-tuning uses several statistical techniques to set thresholds which should enable the classifier to consistently achieve an accuracy rate at or above a predefined target.
Here, accuracy is the proportion of confident classifications for which the result is correct. The default target accuracy is 95% (but this may be set manually).
For a target accuracy of 95%, if is confident is true, and the training documents are a representative sample of the entire set then you can trust that the classifier will be wrong less than 5% of the time.
The relative confidence property is a numerical score that gives an indication of how confident Aluma is that document type is correct.
The score is calculated as follows:
relative confidence = confidence / threshold[best document type]
When relative confidence is greater than or equal to 1.0, is confident will be set to true.
In most cases, the document type and is confident flags are the only ones you should use in production.
If is confident is false, you should not trust document type. Depending on your scenario, you may want to route this document for a person to review, or handle differently in some other way.
In general, unconfident classifications indicate that the content of the document is substantially different from any documents contained in your sample set. Therefore you may also wish to capture these documents, analyse them and if appropriate add some to your samples and retrain your classifier.
Updated about 2 months ago