Export data
In this guide we'll demonstrate how to use the API to export data from tasks once all automation and manual validation is done.
If you want to export data to a folder on a Windows or Linux file system then we recommend using the Aluma File System Agent application instead of using the API. The File System Agent can monitor for tasks ready for export and export data to files in a folder automatically.
The data to export for each document in a task is defined in the settings of the project used to process the task. Data can be exported in a variety of standard formats such as CSV and JSON, or in a custom format. Searchable PDF files can also be exported.
Overview
Your export code should follow this high-level approach:
- Get the tasks that are waiting for export.
- For each task:
- Start the processing of the task, by making a Start Client Action request which returns an Export Specification in the response.
- Get the data from the Export Specification and write it to your system.
- Either complete the processing of the task, by making a Complete Client Action request, or make a Cancel or Fail Client Action request if there was an error.
- Wait for an appropriate duration before repeating.
Get the tasks that are waiting for export
To get the set of tasks that are waiting for export, make a List Tasks request with an appropriate filter specified using query parameters:
GET /tasks?state=pending:start_client_action&client_action=export
A successful request will return a 200 OK
response that contains one object for each task currently waiting for export:
{
"tasks": [
{
"id": "SWdh86taQhKwwpY0HUKD7Q",
"name": "00001.pdf",
"file_collection_id": "hZ7GkxSHSRmk9ibpio_yOQ",
"project_id": "ZbfzIZorTFOV60GX58YUuQ",
"project_name": "ACME Medical Records",
"created_at": "2024-03-25T16:37:08Z",
"documents_count": 5,
"state": "pending:start_client_action",
"client_action": "export"
},
{
"id": "SWdh86taQhKwwpY0HUKD7Q",
"name": "00002.pdf",
"file_collection_id": "z1AG4ddYQkqIUkYMdJM_nA",
"project_id": "ZbfzIZorTFOV60GX58YUuQ",
"project_name": "ACME Medical Records",
"created_at": "2024-03-25T16:38:66Z",
"documents_count": 22,
"state": "pending:start_client_action",
"client_action": "export"
}
]
}
Start the export
To start the processing of a task, make a Start Client Action request on the task using its id
and with the export
action specified as a query parameter:
PUT /tasks/[id]/start_client_action?action=export
A successful request will return a 202 Accepted
response that contains details of all the files that should be exported. We refer to this as an Export Specification:
{
"documents": [
{
"files": [
{
"folder": "",
"filename": "00001.csv",
"source": "content",
"content": "PatientNumber,PatientName\n,X34123,John C Franklin",
"url": null
},
{
"folder": "invoices",
"filename": "00001.pdf",
"source": "url",
"content": null,
"url": "/file_collections/hZ7GkxSHSRmk9ibpio_yOQ/files/V--btp-iRNyudP4miYrShg"
}
]
}
]
}
Get the data and write it to your system
The top level of the Export Specification is an array containing one element for each document in the Task. There will always be at least one document, and there may be more if your project is configured to allow this.
Each document object contains a files
property which is an array of one or more objects. Each object is a file to be exported. For example, your project may be configured to export a CSV file containing data and a PDF file containing the pages of the automatically-separated document.
If you are exporting to a database or document management system then each of these files may correspond to a different type of destination such as a database table, and you will need to interpret the properties in the file
object accordingly using its folder
or filename
properties.
Each file object has these properties:
Property | Description |
---|---|
folder | The relative path to a folder (i.e. subfolder) in which to create the file. This is defined in the project settings and may be derived from extracted data. It may be an empty string, and may include multiple parts separated with a forward slash, e.g. invoices/2024/Dell . |
filename | The filename that the file should be given, including file extension. This is defined in the project settings and may be derived from extracted data. |
source | Where to find the data that should be written to the file. The value of this property will either be content or url . |
content | The data that should be written to the file, if source is content . Newlines are represented by \n. When writing files on Windows you may wish to replace these with \r\n. |
url | A URL from which binary content (such as PDF content) should be downloaded and written to the file, if source is url . The URL is relative to the main API URL. |
You should iterate through each file in each document, check the source
property, get the data from either the content
property or download it from url
(with a GET request) and write it to your destination using the folder
and filename
properties.
Complete the export
To complete the processing of a task when you have exported all the data, make a Complete Client Action request on the task using its id
and with the export
action specified as a query parameter:
PUT /tasks/[id]/complete_client_action?action=export
A successful request will return a 202 Accepted
response
Note that once you have successfully made a Start Client Action request on a task, it is important that you subsequently make either a Complete Client Action, a Cancel Client Action or a Fail Client Action request on the task. If you do not do this then the API will detect that none of these requests were received and re-queue the task for export.
Handle any error during the processing of a task
Transient errors
If a transient (temporary) error occurs during the processing of a task:
- Make a Cancel Client Action request on the task. The task will immediately be re-queued for export.
- Ensure that you do not continuously reprocess the task, but instead wait for an appropriate amount of time before trying again. (You should also consider implementing a maximum number of retries, but this is not mandatory).
To make a Cancel Client Action request:
PUT /tasks/[id]/cancel_client_action?action=export
A simple way to ensure that you do not continuously reprocess the task is to use a cache that allows you to specify an expiry time for each item added to it.
When an error occurs, add an item to the cache with the task ID as the key and the retry time as the expiry time. Before processing a task, check whether there is an item in the cache with that task's ID. If there is then ignore the task and do not process it yet.
To handle the possibility that a task is partially processed before hitting an error, make sure your data export code is idempotent (able to run more than once). For example, if you are writing data to a database table you should do an "upsert" rather than an insert.
Fatal errors
If a fatal error occurs during the processing of a task, make a Fail Client Action request on the task. The task will be put into the failed
state and no further processing will take place.
To make a Fail Client Action request:
PUT /tasks/[id]/fail_client_action?action=export
Wait and repeat
To avoid making unnecessary List Tasks requests, and avoid being rate-limited, after processing all queued tasks you should wait for an appropriate duration before getting a new set of tasks to process. A good approach is this:
- If any tasks were processed in the previous cycle (not counting any ignored because of previous errors), then get a new set of tasks immediately.
- If no tasks were processed in the previous cycle, wait for a few seconds before getting a new set of tasks. For an optimal implementation, start with a short duration and back-off (wait longer) each time there are no new tasks to process, up to a maximum duration.
Updated 7 months ago