Document Classification
Classifiers are resources that understand the content of different document types and are used to classify (determine the document type of) new documents.
For example, a classifier for basic small-business documents might be able to classify document types like these:
- Expense reports
- Bank statements
- Utility bills
Classifiers use advanced machine-learning algorithms, specifically developed and optimised for document classification, and extensively proven in a wide range of demanding classification scenarios.
Before you create a classifier
Before you can create a classifier you need a set of sample documents. These samples are a representative selection of documents of each type that you want to be able to classify. Crucially, the samples are labelled with their document type, e.g. "these are samples of expenses
, these are samples of statements
and so on.
A classifier is then created from (or more specifically "trained on") the sample documents. During training the classifier analyses the content of each sample document, identifies the content that distinguishes each document type, and auto-tunes itself for optimum performance.
The unique machine-learning algorithms used by Aluma classifiers enable robust training on small numbers of samples.
Classifiers can also handle variants of documents that you regard as being the same document type from a business perspective but share little or no content with other variants.
You can find an explanation of how many samples you will need and how to organise them ready for classifier creation in the Preparing sample documents article.
Creating a classifier
Once you have your sample documents there are three ways to create a classifier:
- Use the CLI's create classifier command
- Make a request to the Add samples from ZIP file API endpoint.
- Make multiple requests to the Add single sample file API, once for each sample
Using a classifier
Once you have a classifier, you use it to classify new documents. When you classify a document you get a result that includes three core pieces of data:
- The document type that Aluma believes the document to be
- A boolean is confident flag that indicates whether the service is confident that document type is correct
- A numerical relative confidence score that gives an indication of how confident Aluma is that document type is correct.
In most cases, the document type and is confident flags are the only ones you should use in production.
If is confident is false, you should not trust document type. Depending on your scenario, you may want to route this document for a person to review, or handle differently in some other way.
In general, unconfident classifications indicate that the content of the document is substantially different from any documents contained in your sample set. Therefore you may also wish to capture these documents, analyse them and if appropriate add some to your samples and retrain your classifier.
You can read more about how how scores and confidence are calculated in this article.
Understanding classifier accuracy
One of the unique capabilities of Aluma is that during the training of a classifier, the classifier is automatically tuned to consistently achieve, or exceed a target accuracy.
Classification accuracy is the proportion of confident classifications for which the result is correct.
The default target accuracy is 95%. This means that if a classification result is confident, and the training documents are a representative sample of the entire set, you can trust that the classifier will be wrong less than 5% of the time.
In general, there is a trade-off between target accuracy and the proportion of documents that are confidently classified. Increasing target accuracy will result in fewer documents being confidently classified. The extent of the trade-off depends on the complexity of the document set.
Optimising classifier performance
If you want to extract every ounce of performance from your classifier, there are a number of techniques available to you.
These techniques can increase the proportion of confident classifications for any given target accuracy and generally aim to identify and resolve one of these possible issues with your document samples:
- There aren't enough samples for one or more document types
- Some samples are labelled with incorrect document types
We have a variety of tools available to help with this exercise and would be delighted to help you understand what the best approach is for your particular documents and support you with ensuring you have an optimally-tuned classifier. Please contact us at [email protected].
Classifying documents with different (non-English) languages
Aluma document classification can be used with all documents, whichever language (or languages) they contain.
If your documents are image-based, and therefore need to be read by Aluma before document classification, then you should specify the appropriate language(s) so you get optimal results by creating and using a read-profile. See Read documents with different languages for instructions on how to do this.
Updated over 3 years ago