Automatic classification is the process of assigning a document a category based on its content or appearance. Classifiers learn the distinguishing features of different document types (categories), and can then be used to classify new documents.

There are two steps involved in preparing a classifier for use:

  1. Create a new classifier
  2. Add sample documents for each of your document types to the classifier

When you add the samples to the classifier, it will automatically learn the distinguishing features of each document type, and auto-tune itself for maximum accuracy. Once this is done, you can use the classifier to classify other documents.

Step 1: Create the classifier

To create a classifier, make a request to the Create Classifier endpoint, specifying a name for the classifier.

🚧

Remember to add samples

If you try to classify a document using a classifier before adding samples to it, you will get classification results where the document type and document type scores are null.

Step 2: Add samples to the classifier

For classifiers to learn the distinguishing features of each type of document, they need to have multiple samples of those document types added to them. These samples should be real documents that are representative of those you want to classify in production.

The easiest way to add samples to a classifier is to create a ZIP-format file containing the samples, and make a request to the Add samples from ZIP file endpoint.

How many samples do I need?

The number of samples you need to add to a classifier for optimal performance depends on the type of the content within the document types. Document types with highly variable content need more samples, so that the classifier can learn about that variety, and what distinguishes them from other types.

The table below gives some general guidelines, where:

  • Minimum = The approximate number required to get sensible results at least some of the time
  • Sufficient = The approximate number required to get good results most of the time
  • Optimal = The approximate number required to get the best results
Type of contentMinimumSufficientOptimal
Mostly fixed content1510+
Some variable content52050+
Highly variable content2050100+

We are always very happy to help you get the best possible performance from a classifier, so just get in touch and we can make sure you have an optimal classifier, or help you tune it.

Using the classifier

Now you've added samples to the classifier you are ready to use it. To do this you should:

  1. Create a new document, by making a request to the Create document endpoint and passing the file containing your document contents (PDF, DOCX etc.)
  2. Call the Classify document endpoint with that document's Document ID and the Classifier name

The response from the Classify document looks like this:

{
  "_id": "WlrfT0J1dEChGZ5wXVTPzQ",
  "classification_results": {
    "document_type": "Agreements",
    "is_confident": true,
    "document_type_scores": [
      {
        "document_type": "Agreements",
        "score": 70.0982361
      },
      {
        "document_type": "Expenses",
        "score": 29.9017639
      }
    ]
  }
}

The document_type property is the 'best guess' document type, or null if the classifier has not had sufficient samples added.

The is_confident property is a flag indicating whether the classifier is confident in its classification.

The document_type_scores collection includes a score (out of 100) for every document type that the classifier knows about. You can think of this score as "how likely it is that the document being classified is this document type".

You can read more about confidence and scoring here.