Preparing sample documents

Before you can create a classifier you need a set of sample documents. These samples are a representative selection of documents of each type that you want to be able to classify.

A classifier is then created from (or more specifically "trained on") the sample documents. During training the classifier analyses the content of each sample document, identifies the content that distinguishes each document type, and auto-tunes itself for optimum performance.

It is important to prepare your sample documents correctly, and have enough samples of each document type, in order to create a classifier that performs well.

This article explains how many samples you need and how to prepare them for creation of a classifier.

How many samples do I need?

The number of samples you need to add to a classifier for optimal performance depends on the type of the content within the document types. Document types with highly variable content need more samples, so that the classifier can learn about that variety, and what distinguishes them from other types.

The table below gives some general guidelines, where:

  • Minimum = The approximate number required to get sensible results at least some of the time
  • Sufficient = The approximate number required to get good results most of the time
  • Optimal = The approximate number required to get the best results

Type of content

Minimum

Sufficient

Optimal

Mostly fixed content

1

5

10+

Some variable content

5

20

50+

Highly variable content

20

50

100+

Supported file types

Files of the following types can be used as samples:

  • PDFs that contain electronic content
  • Text files
  • Microsoft Office Word, Excel or PowerPoint documents

๐Ÿšง

All sample documents used for building a classifier must already have content. For image-based documents that means they must have been read (OCRed) and included as either text files or PDFs with content. Classifier creation is fastest with text file samples, and for large sets of samples text files are much smaller and easier to work with. If your documents are images or PDFs without content then you can generate some text files or PDFs with content easily with the CLI.

Structuring your samples

It is important that you structure your sample set correctly, so samples are added to the classifier with the appropriate document types.

There must be at least two document types, and there must be at least two samples per document type in order to make a request to this endpoint.

The samples should be organised into folders where the folder name is the document type. For example:

* samples
  * Agreements
    * office-rental-agreement.pdf
    * leighton-acquisition.pdf
  * Expenses
    * emerson-tunt-nov-2016.pdf
    * jessie-palmer-jun-2017.pdf

No files should be present in the root folder.

Multiple levels of document types

It is possible to have a hierarchy of document types, although the classifier does not attach any special meaning to sub-types or use this information in any way. You will get identical results from a flat structure as from the equivalent hierarchical structure.

When you have more than one level of document type, the document type result from a classification will be of the form parent-type/child-type, with a / character separating each level.

To create a sample set with a hierarchy like this, simply create additional subfolders for sub-types.

In the example folder structure below, four document types will be added to the classifier: "Agreements", "Expenses", "HR/CVs", and "HR/Contracts".

* Agreements
  * office-rental-agreement.pdf
  * leighton-acquisition.pdf
* Expenses
  * emerson-tunt-nov-2016.pdf
  * jessie-palmer-jun-2017.pdf
* HR
  * CVs
    * jo-bloggs-cv.pdf
    * charlie-briar-cv.pdf
  * Contracts
    * emerson-tunt-employment-contract.pdf
    * ashley-lake-employment-contract.pdf
    * jessie-palmer-employment-contract.pdf

Creating a samples ZIP file

Once you have your sample documents there are three ways to create a classifier:

The first two of these require you to create a ZIP file of your sample set and pass that file as an input.

๐Ÿ“˜

You should make sure that the folders for each top-level document type are in the root of the ZIP file. With some ZIP tools, selecting the parent folder to ZIP will result in an extra folder level in the root of the ZIP file.

Maximum file sizes

The maximum size of a samples ZIP file is 30MB. The maximum size of any individual file in the ZIP is 15MB (uncompressed). If you are working with very large (multi-megabyte) files, you can either split your samples across multiple ZIP files or add each sample one at a time using the Add single sample file API endpoint.

Identifying samples or document types within a large unstructured set

Ideally you will already have an understanding of which document types you want a classifier to identify, as this will usually be determined by a subsequent business process.

Sometimes however you may be dealing with a document set you are not familiar with, or where the document types are not so obvious. Alternatively, you may understand the document types but be finding it difficult to locate samples of each one within the set.

We have an offline tool that helps with both of these problems. It analyses a document set, identifies "clusters" of similar documents, enables you to label samples and in real-time updates the clusters. This enables you to extremely quickly create a labelled sample set that you know will perform well.

If you would like the tool, please contact us at [email protected] and we'll help you get going.