Before you can create a classifier you need a set of sample documents. These samples are a representative selection of documents of each type that you want to be able to classify.
A classifier is then created from (or more specifically "trained on") the sample documents. During training the classifier analyses the content of each sample document, identifies the content that distinguishes each document type, and auto-tunes itself for optimum performance.
It is important to prepare your sample documents correctly, and have enough samples of each document type, in order to create a classifier that performs well.
This article explains how many samples you need and how to prepare them for creation of a classifier.
The number of samples you need to add to a classifier for optimal performance depends on the type of the content within the document types. Document types with highly variable content need more samples, so that the classifier can learn about that variety, and what distinguishes them from other types.
The table below gives some general guidelines, where:
- Minimum = The approximate number required to get sensible results at least some of the time
- Sufficient = The approximate number required to get good results most of the time
- Optimal = The approximate number required to get the best results
|Type of content||Minimum||Sufficient||Optimal|
|Mostly fixed content||1||5||10+|
|Some variable content||5||20||50+|
|Highly variable content||20||50||100+|
Files of the following types can be used as samples:
- PDFs that contain electronic content
- Text files
- Microsoft Office Word, Excel or PowerPoint documents
All sample documents used for building a classifier must already have content. For image-based documents that means they must have been read (OCRed) and included as either text files or PDFs with content. Classifier creation is fastest with text file samples, and for large sets of samples text files are much smaller and easier to work with. If your documents are images or PDFs without content then you can generate some text files or PDFs with content easily with the CLI.
It is important that you structure your sample set correctly, so samples are added to the classifier with the appropriate document types.
There must be at least two document types, and there must be at least two samples per document type in order to make a request to this endpoint.
The samples should be organised into folders where the folder name is the document type. For example:
* samples * Agreements * office-rental-agreement.pdf * leighton-acquisition.pdf * Expenses * emerson-tunt-nov-2016.pdf * jessie-palmer-jun-2017.pdf
No files should be present in the root folder.
It is possible to have a hierarchy of document types, although the classifier does not attach any special meaning to sub-types or use this information in any way. You will get identical results from a flat structure as from the equivalent hierarchical structure.
When you have more than one level of document type, the document type result from a classification will be of the form
parent-type/child-type, with a
/ character separating each level.
To create a sample set with a hierarchy like this, simply create additional subfolders for sub-types.
In the example folder structure below, four document types will be added to the classifier: "Agreements", "Expenses", "HR/CVs", and "HR/Contracts".
* Agreements * office-rental-agreement.pdf * leighton-acquisition.pdf * Expenses * emerson-tunt-nov-2016.pdf * jessie-palmer-jun-2017.pdf * HR * CVs * jo-bloggs-cv.pdf * charlie-briar-cv.pdf * Contracts * emerson-tunt-employment-contract.pdf * ashley-lake-employment-contract.pdf * jessie-palmer-employment-contract.pdf
Once you have your sample documents there are three ways to create a classifier:
- Using the CLI's create classifier command
- Making a request to the Add samples from ZIP file API endpoint.
- Making multiple requests to the Add single sample file API, once for each sample
The first two of these require you to create a ZIP file of your sample set and pass that file as an input.
You should make sure that the folders for each top-level document type are in the root of the ZIP file. With some ZIP tools, selecting the parent folder to ZIP will result in an extra folder level in the root of the ZIP file.
The maximum size of a samples ZIP file is 30MB. The maximum size of any individual file in the ZIP is 15MB (uncompressed). If you are working with very large (multi-megabyte) files, you can either split your samples across multiple ZIP files or add each sample one at a time using the Add single sample file API endpoint.
Ideally you will already have an understanding of which document types you want a classifier to identify, as this will usually be determined by a subsequent business process.
Sometimes however you may be dealing with a document set you are not familiar with, or where the document types are not so obvious. Alternatively, you may understand the document types but be finding it difficult to locate samples of each one within the set.
We have an offline tool that helps with both of these problems. It analyses a document set, identifies "clusters" of similar documents, enables you to label samples and in real-time updates the clusters. This enables you to extremely quickly create a labelled sample set that you know will perform well.
If you would like the tool, please contact us at [email protected] and we'll help you get going.
Updated almost 3 years ago