Extract data from the specified document using an extractor.
Overview
This endpoint extracts data from the specified document using an extractor. By default it returns details of the extracted data, but it can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.
On-demand reading
Note that documents created from image files (TIFF, JPEG, JPEG2000) and PDFs that contain only images are automatically read (OCRed) before extraction is performed. For small documents this will usually be very quick, but for very large documents you should expect response time to be longer.
Extracting invoice data
To extract invoice data from UK invoices you can use the built-in extractor named aluma.invoices.gb
. For more information, see Extracting invoice data.
Extracted data results
If the Accept
header is not set or is application/vnd.waives.resultformats.extractdata+json
, then the response contains details of the data extracted from the document.
The field_results
section of the response contains the data extracted from the document. This is an array containing one element for each field in the extractor configuration. Each field looks like this:
{
"field_name": "Amount",
"result": {
"text": "$5.50",
"value": null,
"rejected": false,
"reject_reason": "None",
"areas": [
{
"top": 558.7115,
"left": 276.48,
"bottom": 571.1989,
"right": 298.58,
"page_number": 1
}
],
"proximity_score": 100.0,
"match_score": 100.0,
"text_score": 100.0
},
"alternatives": null,
"tabular_results": null,
"rejected": false,
"reject_reason": "None"
}
The properties of the field are:
field_name
: The name of the fieldresult
: The primary result for the field (null
for a table field)rejected
: A flag indicating whether the field results should be considered potentially invalidreject_reason
: The reason for rejection of the fieldalternatives
: Secondary (alternative) results for the field
The primary result, and any alternative results are structured like this:
{
"text": "$5.50",
"value": null,
"rejected": false,
"reject_reason": "None",
"areas": [
{
"top": 558.7115,
"left": 276.48,
"bottom": 571.1989,
"right": 298.58,
"page_number": 1
}
],
"proximity_score": 100.0,
"match_score": 100.0,
"text_score": 100.0
}
The properties of a result are:
text
: The text of the resultvalue
: The value as a non-text type (e.g. Decimal or DateTime), if availablerejected
: A flag indicating whether the result should be considered potentially invalidreject_reason
: The reason for rejection of the resultareas
: A list of areas from which the result originatedproximity_score
: A score indicating how well any proximity rules in the configuration for this field have been met (how close this result is, or isn't, to particular content nearby)match_score
: A score indicating how well the text matched the search criteriatext_score
: A score indicating the OCR confidence assigned to the actual text that was extracted
The area co-ordinates are relative to the top left of the page and are in points (1/72 inch). The page number is one-based (i.e. the first page of a document is page 1).
Score properties value range from 0 to 100, where 100 is a perfect score.
Getting a response in redaction request format
This endpoint can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.
If the Accept
header is application/vnd.waives.requestformats.redact+json
then the response you receive will be a redaction request that will redact all data extracted from the document. You can either send this directly in a request to this endpoint or edit it first.
One redaction mark is created for every non-empty result and alternative result for every field.
Each redaction mark is labelled with the extraction field it came from to help you if you want to edit it, for example by removing marks for specific fields.
RESPONSES
200 Data was extracted from the document and results are in the response
400 No document ID is specified or no extractor name was specified
401 There is no Authorization header or the access token is invalid
404 The specified document does not exist or the specified extractor does not exist