Data extraction

Aluma makes it easy to reliably extract data from any documents, whether it’s a simple reference number or complex data with highly variable location. It’s easy to get started, but powerful enough to deal with the most challenging scenarios.

To extract data from your documents, you build an extractor that specifies how the data should be located (and optionally formatted and validated).

There are two different ways to create extractors:

  • Create an extractor by combining one or more modules from Aluma's library of data extraction modules
  • Create an extractor with custom behaviour from Aluma's comprehensive toolkit of data extraction techniques

You can mix and match these approaches if you wish, using off-the-shelf modules where possible and building a custom extractor only where no module meets your needs.

For convenience some pre-configured extractors, such as invoices extractors, are available for you to use without any configuration.

Data extraction with extraction modules

Extraction modules are building blocks that contain configuration for extracting common types of data and which can be easily combined into an extractor that can then be used on your documents.

Aluma provides a library of modules that are ready to use, with general purpose modules for common data like dates and reference numbers, modules for extracting personal information and modules for invoice data.

Modules sometimes have settings that control their behaviour and enable a basic level of customisation.

You can create an extractor from modules in several ways:

  • In the Aluma dashboard, although not all of the flexibility of modules is currently available here
  • Using the CLI
  • Using the API or client libraries

For more detail on building extractors this way, see Data extraction with modules. You may also find the Extract data using modules guide helpful if you want to get hands-on.

Custom data extraction

Extractors can also be created using the Extraction Builder and Document Studio desktop tools.

Extraction Builder uses a code-free flow diagram approach which allows simple core concepts such as fields, searches, proximity rules, area constraints, evaluators and formatters to be flexibly combined to create powerful extraction logic, all within the user-friendly project environment provided by Microsoft Visual Studio.

Document Studio provides an environment for local testing of the extractor on your documents.

For more detail on building extractors this way, see Custom data extraction. You may also find the Extract data with a custom extractor guide helpful if you want to get hands-on.

Extracting data from documents with different (non-English) languages

Aluma data extraction can be used with all documents, whichever language (or languages) they contain. Extraction Modules and Extraction Builder components can include searches, formatters and validators that use characters, formats and validation rules relevant to any language.

If your documents are image-based, and therefore need to be read by Aluma before data extraction, then you should specify the appropriate language(s) so you get optimal results by creating and using a read-profile. See Read documents with different languages for instructions on how to do this.