About this guide
Aluma makes it easy to reliably classify scanned or digital documents based on their content, even where the content is highly variable. Advanced machine learning technology means all you need to get going is a few samples of each document type.
In this guide we'll build and test a classifier capable of identifying some example documents that are US mortgage documents.
Working through this guide should take about 5 minutes.
Before you begin
Before you start you must have:
- Installed the Aluma CLI and logged in to connect it to your account
- Installed the example documents
If you have not done these steps, follow the Getting started guide and then return to this one.
Create a classifier
The US mortgage documents we're going to classify can be found in the `/mortgages' examples directory.
The mortgage examples are split into two sub-directories:
buildis a set of samples organised by document type, which have already been read (OCRed).
testis a set of test documents
Let's create a classifier called
mortgage-classifier using a ZIP file of the samples. To do so type the following command:
aluma create classifier mortgage-classifier examples/mortgages/build/samples.zip
Sample documents must already have been read
All sample documents used for building a classifier must already have content. For image-based documents that means they must have been read (OCRed) and included as either text files or PDFs with content. Classifier creation is fastest with text file samples, and for large sets of samples text files are much smaller and easier to work with. If your documents are images or PDFs without content then you can generate some text files or PDFs with content easily with the CLI.
Now that we have a classifier, let's test it on our test files.
To classify files, we provide the name of the classifier and either a single filename or a file pattern that matches multiple files.
The test directory contains a mix of scanned documents which have already been OCRed (PDFs) and which haven't (TIFs). The service will automatically OCR documents when necessary.
You can classify all the files in the test directory and write the results to the console with the following command:
aluma classify mortgage-classifier examples/mortgages/test/*.*
classify command streams results to the console as each file is classified. The files are processed in parallel, so the order of the results may differ. You will see output like this:
FILE DOCUMENT TYPE CONFIDENT Notice of Lien.pdf Notice of Lien true Assignment of Deed of Trust.pdf Assignment of Deed of Trust true Notice of Default.pdf Notice of Default true Deed of Trust.pdf Deed of Trust true Correspondence.pdf Correspondence true Notice of Default.tif Notice of Default true Assignment of Deed of Trust.tif Assignment of Deed of Trust true Notice of Lien.tif Notice of Lien true Correspondence.tif Correspondence true Deed of Trust.tif Deed of Trust true
Here we can see that using the classifier we built from our sample documents, the service has correctly identified the document type of each of the new documents we processed.
The output includes the classifier's determination of the document's type and whether it is confident in its answer.
You can learn more about classification in these articles:
- Classification overview - introduces classification concepts and capabilities
- Preparing sample documents - explains how many sample documents you need in different scenarios and how to prepare them effectively
- Classification results - explains how to interpret and use classification results
Updated almost 3 years ago