Classify documents
About this guide
Aluma makes it easy to reliably classify scanned or digital documents based on their content, even where the content is highly variable. Advanced machine learning technology means all you need to get going is a few samples of each document type.
In this guide we'll build and test a classifier capable of identifying some example documents that are US mortgage documents.
Working through this guide should take about 5 minutes.
Before you begin
Before you start you must have:
- Installed the Aluma CLI and logged in to connect it to your account
- Installed the example documents
If you have not done these steps, follow the Getting started guide and then return to this one.
Create a classifier
The US mortgage documents we're going to classify can be found in the `/mortgages' examples directory.
The mortgage examples are split into two sub-directories:
build
is a set of samples organised by document type, which have already been read (OCRed).test
is a set of test documents
Let's create a classifier called mortgage-classifier
using a ZIP file of the samples. To do so type the following command:
aluma create classifier mortgage-classifier examples/mortgages/build/samples.zip
Sample documents must already have been read
All sample documents used for building a classifier must already have content. For image-based documents that means they must have been read (OCRed) and included as either text files or PDFs with content. Classifier creation is fastest with text file samples, and for large sets of samples text files are much smaller and easier to work with. If your documents are images or PDFs without content then you can generate some text files or PDFs with content easily with the CLI.
Classify documents
Now that we have a classifier, let's test it on our test files.
To classify files, we provide the name of the classifier and either a single filename or a file pattern that matches multiple files.
The test directory contains a mix of scanned documents which have already been OCRed (PDFs) and which haven't (TIFs). The service will automatically OCR documents when necessary.
You can classify all the files in the test directory and write the results to the console with the following command:
aluma classify mortgage-classifier examples/mortgages/test/*.*
The classify
command streams results to the console as each file is classified. The files are processed in parallel, so the order of the results may differ. You will see output like this:
FILE DOCUMENT TYPE CONFIDENT
Notice of Lien.pdf Notice of Lien true
Assignment of Deed of Trust.pdf Assignment of Deed of Trust true
Notice of Default.pdf Notice of Default true
Deed of Trust.pdf Deed of Trust true
Correspondence.pdf Correspondence true
Notice of Default.tif Notice of Default true
Assignment of Deed of Trust.tif Assignment of Deed of Trust true
Notice of Lien.tif Notice of Lien true
Correspondence.tif Correspondence true
Deed of Trust.tif Deed of Trust true
Here we can see that using the classifier we built from our sample documents, the service has correctly identified the document type of each of the new documents we processed.
The output includes the classifier's determination of the document's type and whether it is confident in its answer.
You can learn more about classification in these articles:
- Classification overview - introduces classification concepts and capabilities
- Preparing sample documents - explains how many sample documents you need in different scenarios and how to prepare them effectively
- Classification results - explains how to interpret and use classification results
Updated almost 3 years ago