The API is very straightforward to use, and you can process documents successfully with just a few simple requests. But when you are preparing to use the API in your production service there are several best practices you should follow so that your integration is resilient and efficient.
Access tokens expire periodically and a new token must be obtained in order to continue making requests to the other API endpoints.
Your code should check the expiration time of the token (the
exp JWT property) and request a new token just before this time. You should not request a new token more frequently than this or you may be rate-limited.
In order to be resilient to transient errors, you should wrap all requests to the API in a retry policy. We recommend a limited exponential back-off policy, circuit breaker, or combination of the two. A circuit breaker is a useful pattern allowing you to degrade your application's service while the remote dependency is unavailable, where an infinite retry policy will instead cause your application to hang.
Your retry policy should retries requests only in the following cases:
- Any request that fails because of a network error
- Any request that returns a response with HTTP status code >= 500 (server errors) or 408 (request timeout)
Be sure to include a jitter in your exponential back-off policy, to avoid sending all your retries at once, and place a limit on the number of retries you attempt.
document resource is the foundation of the API. In order to process your files (PDFs, TIFFs etc.), you will:
- Create a new
documentresource using the Create document (upload) or Create document (import) endpoints
- OCR, classify, extract data from, or redact the document using the appropriate endpoints and the
- Delete the
Your API client has a maximum number of
document resources that can exist at any time. This number is deliberately kept low to encourage you to keep your documents within the service for as short a time as possible.
You should ensure that you always delete the
document resource once you've processed your document, even in error cases. Otherwise you will hit your document limit and the Create document request will return a 403 response code.
If you have many documents and want to process these as fast as possible, you can do this by creating and using multiple
document resources at the same time.
You should process a maximum of 10 documents at a time, even if your API client permits you to create more documents than this. If your volumes are sufficiently large that it would be valuable for you to parallelise to a greater degree, please talk to us.
When processing documents in parallel, the elapsed time taken to process each individual document may be slightly increased, even though overall time to process all documents will be decreased.
An easy basic approach to parallelisation is to take a batch of documents (say 10), process them all and wait until all are complete before processing another batch. However, each document will take a different amount of time to complete processing. Therefore you will maximise throughput if you ensure that once any document has been completed the next is started.
One common pattern used to accomplish this is:
- Push all documents onto a thread-safe queue
- Create a pool of workers on different threads (as many as the degree of parallelisation you want)
- Each worker pulls the next document from the queue, processes it, does whatever it needs to with results (perhaps pushing to a results queue) and then pulls another document
Utilising the Bulkhead pattern will allow you to constrain the parallelism whilst buffering HTTP requests that can't currently be serviced.
Updated almost 4 years ago