Extract document data

Overview

This endpoint extracts data from the specified document using an extractor. By default it returns details of the extracted data, but it can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.

On-demand reading

Note that documents created from image files (TIFF, JPEG, JPEG2000) and PDFs that contain only images are automatically read (OCRed) before extraction is performed. For small documents this will usually be very quick, but for very large documents you should expect response time to be longer.

Extracting invoice data

To extract invoice data from UK invoices you can use the built-in extractor named aluma.invoices.gb. For more information, see Extracting invoice data.

Extracted data results

If the Accept header is not set or is application/vnd.waives.resultformats.extractdata+json, then the response contains details of the data extracted from the document.

The field_results section of the response contains the data extracted from the document. This is an array containing one element for each field in the extractor configuration. Each field looks like this:

{
  "field_name": "Amount",
  "result": {
    "text": "$5.50",
    "value": null,
    "rejected": false,
    "reject_reason": "None",
    "areas": [
      {
        "top": 558.7115,
        "left": 276.48,
        "bottom": 571.1989,
        "right": 298.58,
        "page_number": 1
      }
    ],
    "proximity_score": 100.0,
    "match_score": 100.0,
    "text_score": 100.0
  },
  "alternatives": null,
  "tabular_results": null,
  "rejected": false,
  "reject_reason": "None"
}

The properties of the field are:

field_name: The name of the field
result: The primary result for the field (null for a table field)
rejected: A flag indicating whether the field results should be considered potentially invalid
reject_reason: The reason for rejection of the field
alternatives: Secondary (alternative) results for the field

The primary result, and any alternative results are structured like this:

{
  "text": "$5.50",
  "value": null,
  "rejected": false,
  "reject_reason": "None",
  "areas": [
    {
      "top": 558.7115,
      "left": 276.48,
      "bottom": 571.1989,
      "right": 298.58,
      "page_number": 1
    }
  ],
  "proximity_score": 100.0,
  "match_score": 100.0,
  "text_score": 100.0
}

The properties of a result are:

text: The text of the result
value: The value as a non-text type (e.g. Decimal or DateTime), if available
rejected: A flag indicating whether the result should be considered potentially invalid
reject_reason: The reason for rejection of the result
areas: A list of areas from which the result originated
proximity_score: A score indicating how well any proximity rules in the configuration for this field have been met (how close this result is, or isn't, to particular content nearby)
match_score: A score indicating how well the text matched the search criteria
text_score: A score indicating the OCR confidence assigned to the actual text that was extracted

The area co-ordinates are relative to the top left of the page and are in points (1/72 inch). The page number is one-based (i.e. the first page of a document is page 1).

Score properties value range from 0 to 100, where 100 is a perfect score.

Getting a response in redaction request format

This endpoint can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.

If the Accept header is application/vnd.waives.requestformats.redact+json then the response you receive will be a redaction request that will redact all data extracted from the document. You can either send this directly in a request to this endpoint or edit it first.

One redaction mark is created for every non-empty result and alternative result for every field.

Each redaction mark is labelled with the extraction field it came from to help you if you want to edit it, for example by removing marks for specific fields.

RESPONSES

200 Data was extracted from the document and results are in the response
400 No document ID is specified or no extractor name was specified
401 There is no Authorization header or the access token is invalid
404 The specified document does not exist or the specified extractor does not exist