Search for text relative to other text

This article explains how to use geometric and logical proximity rules to find some target text relative to a keyword. If you aren't familiar with proximity rules, you may find it helpful to read the Proximity Rules overview article.

As a reminder, the Search is the component that looks inside the document text content for a matching regular expression based on the parameters you specify, and holds one or more results for further processing.

The Field is the final container for the value you are extracting – this will display in Document Studio’s testing interface, and in the output file or response from the extraction in the Aluma service.

A 'value' Search connected to a Field (click to enlarge)A 'value' Search connected to a Field (click to enlarge)

A 'value' Search connected to a Field (click to enlarge)

Connecting a single Search to a Field via Provide Results/Parameters will pick up all instances of data in the document text content that matches the regular expression, accumulate them into a list, and provide them to the field. However, a more common requirement is to narrow this down to one piece of data which lies close to a known heading or keyword.

We do this by adding another Search component, one which uses a regular expression to find the keyword, and chaining it with our existing one. Instead of using Provide Results/Parameters as the connector, we will use a Proximity Rule.

Select the either a Geometric Proximity Rule or Logical Proximity Rule connector in the toolbox and drag from the 'keyword' Search to the 'value' Search.

A 'keyword' search connected to a 'value' search via a proximity rule (click to enlarge)A 'keyword' search connected to a 'value' search via a proximity rule (click to enlarge)

A 'keyword' search connected to a 'value' search via a proximity rule (click to enlarge)

Geometric proximity rule

Often you will see important data in a fairly consistent, expected location with headings or labels indicating which data is which. This is true of fixed-form documents such as order forms, invoices, or legal records.

The scenario where you would use a geometric proximity rule (click to enlarge)The scenario where you would use a geometric proximity rule (click to enlarge)

The scenario where you would use a geometric proximity rule (click to enlarge)

When you see data in blocks such as this, the correct extraction technique to use is the Geometric Proximity Rule.

The relative positions of the 'keyword' Search and 'value' Search items on the canvas are significant. It is automatically inferred that the keyword text will be in the same direction (north, south, west etc.) relative to the value text, and the Regions are automatically populated this way. They can however be adjusted independently of the position on the canvas later on.

Logical proximity rule

Sometimes the data you wish to extract is embedded in free-flowing text whose contents vary so much that it is impractical to predict where a value will be found geometrically. Rather than using fixed distances and directions to a known keyword, you need to specify looser and more logical requirements. For example, you need a rule that expects a value before or after the keyword in the scope of the entire sentence, perhaps within a specified number of words and characters.

The scenario where you would use a logical proximity rule (click to enlarge)The scenario where you would use a logical proximity rule (click to enlarge)

The scenario where you would use a logical proximity rule (click to enlarge)

When you see inline data such as this, the correct extraction technique to use is the Logical Proximity Rule.

Placing the 'keyword' search above or to the left of the value search signifies that it is expected to be logically before the value in the document. For example, in an unstructured document this may be earlier in a paragraph that flows onto the next line or page - geometrically further to the right or even lower than the value in the coordinate system, but would still be understood by Aluma as coming before it.

Adjusting the rules and searches

The properties to be set on the components are entirely dependent on the extraction scenario you are trying to cover. The default settings should work well if your value is near to the keyword in the relative direction laid out on the canvas.

However, if your value is still not found, consider the following:

  • Change the Region details to encompass a larger area, or add another region if the value can lie in a different direction (such as slightly north-west rather than always being directly north)

  • Make sure the 'keyword' and 'value' searches are actually finding results. You can use the Show Extraction Details option within Document Studio, or temporarily connect each keyword search to its own field

More information on debugging is covered by a later article.