Searches

The Search component is the item you will use most often inside the Extraction Builder, and is the main way of employing a regular expression to locate data based on format. This can be a self-contained extraction technique if the data is in a strong enough format, but in most cases is just a starting point and the data will have to be refined further.

The Search is the component that looks inside the document text content for a match of a regular expression, in combination with some additional properties, and holds one or more results for further processing.

By default the match can occur anywhere within the full document text content, but the Search can be adjusted such that matches will only be returned if the entire regular expression can be satisfied within more restrictive sections of text:

  • A single page - this allows data to flow between paragraphs but not over page boundaries
  • A paragraph (which is a logical grouping of connected text lines) - unrelated text across paragraphs will not be extracted as a single match. For dates, this is probably the best balance between caution and flexibility, and therefore this is the selected setting for the Aluma date extraction module.
  • A text line - with this option set, data that flows across text lines will no longer be returned
  • A cell (which is a group of words between tab breaks) - this is the most restrictive, so with this option set, data with a large tab spacing will no longer be returned correctly, and nor will any data that flows across text lines

The Search results are a group of Alternatives which by default are ordered from Left to Right, Top to Bottom as they appear in the document.

These results can be passed directly to a field, to other components that reorder or combine the results, or for another purpose (for example, if the search is finding nearby keywords instead of the actual target data).