Basic concepts

Custom extractors are created in Extraction Builder by combining extraction components to locate the data you need and output it to fields in the results.

There are a quite a few components that you can use, and each component has a variety of settings to control its behaviour and handle more advanced scenarios, but there are some basic concepts that will help you to understand how everything fits together.

Locating candidate results

These components allow you to locate candidate results in the text:

  • Searches locate text with a fixed or variable format such as a reference number, postal code or anything that can be matched using a regular expression.

  • Static Areas allow you to restrict a search to a specific fixed area, and Area Modifiers can move the entire search area by a specific amount.

  • Proximity Rules and Area Constraints allow you to find data relative to that of other text or features of the document.

  • Dictionaries look for a fixed set of possible values in the text, or a subset of text limited by a search. Matchers do the same but against records in a database.

  • The Mark Sense component enables you to determine whether a mark is present in a specific area.

  • The Table component captures tabular data.

Evaluating different results

Where there is some variation in where the data appears in the document, or how it is presented or identifiable, you may need to use multiple techniques to locate the data. For this cases there are a set of components that allow you to fall back to alternative logic, choose between possible results, order result alternative, merge results and more.

Filtering, validating and formatting data

With any result you locate you can use one of the Formatter components to format the results so that they adhere to a specific data type or text structure using substitutions, search-and-replace operations or data conversions.

You can also use a Validator component to ensure that all results contain only date that is valid for your downstream process.

Outputting results

Fields are the final containers of the extracted data. These are what you will receive back in your extraction results for each document.