Classify documents

About this guide

Aluma makes it easy to reliably classify scanned or digital documents based on their content, even where the content is highly variable. Advanced machine learning technology means all you need to get going is a few samples of each document type.

In this guide we'll build and test a classifier capable of identifying some example documents that are US mortgage documents.

Working through this guide should take about 5 minutes.

Before you begin

Before you start you must have:

  • Installed the Aluma CLI and logged in to connect it to your account
  • Installed the example documents

If you have not done these steps, follow the Getting started guide and then return to this one.

Create a classifier

The US mortgage documents we're going to classify can be found in the `/mortgages' examples directory.

The mortgage examples are split into two sub-directories:

  • build is a set of samples organised by document type, which have already been read (OCRed).
  • test is a set of test documents

Let's create a classifier called mortgage-classifier using a ZIP file of the samples. To do so type the following command:

aluma create classifier mortgage-classifier examples/mortgages/build/samples.zip

🚧

Sample documents must already have been read

All sample documents used for building a classifier must already have content. For image-based documents that means they must have been read (OCRed) and included as either text files or PDFs with content. Classifier creation is fastest with text file samples, and for large sets of samples text files are much smaller and easier to work with. If your documents are images or PDFs without content then you can generate some text files or PDFs with content easily with the CLI.

Classify documents

Now that we have a classifier, let's test it on our test files.

To classify files, we provide the name of the classifier and either a single filename or a file pattern that matches multiple files.

The test directory contains a mix of scanned documents which have already been OCRed (PDFs) and which haven't (TIFs). The service will automatically OCR documents when necessary.

You can classify all the files in the test directory and write the results to the console with the following command:

aluma classify mortgage-classifier examples/mortgages/test/*.*

The classify command streams results to the console as each file is classified. The files are processed in parallel, so the order of the results may differ. You will see output like this:

FILE                                 DOCUMENT TYPE                    CONFIDENT
Notice of Lien.pdf                   Notice of Lien                   true
Assignment of Deed of Trust.pdf      Assignment of Deed of Trust      true
Notice of Default.pdf                Notice of Default                true
Deed of Trust.pdf                    Deed of Trust                    true
Correspondence.pdf                   Correspondence                   true
Notice of Default.tif                Notice of Default                true
Assignment of Deed of Trust.tif      Assignment of Deed of Trust      true
Notice of Lien.tif                   Notice of Lien                   true
Correspondence.tif                   Correspondence                   true
Deed of Trust.tif                    Deed of Trust                    true

Here we can see that using the classifier we built from our sample documents, the service has correctly identified the document type of each of the new documents we processed.

The output includes the classifier's determination of the document's type and whether it is confident in its answer.

You can learn more about classification in these articles: