Extract data using modules

About this guide

Aluma makes it easy to extract data from any documents, including from images that first require OCR.

In this guide we'll extract a variety of data from some sample documents using an extractor built from Aluma's extraction modules.

Working through this guide should take about 10 minutes.

πŸ“˜

Other ways to configure data extraction

You can also create custom extractors using the Extraction Builder Visual Studio plugin. This approach gives you access to the full toolkit of data extraction capabilities if you need something more customised than available through extraction modules. The Extract data with a custom extractor guide provides an introduction to the Extraction Builder, and the Custom data extraction section of the docs provides more information on how to build custom extractors.

Before you begin

Before you start you must have:

  • Installed the Aluma CLI and logged in to connect it to your account
  • Installed the example documents

If you have not done these steps, follow the Getting started guide and then return to this one.

Recap: Create an extractor directly from basic modules

In the Getting started guide you created an extractor that extracts the names and social security numbers from the Dive Record documents.

As a reminder, here's how we did that (you don't need to run these commands again now):

> aluma create extractor from-modules dive-record aluma.name aluma.us_social_security_number
Creating extractor 'dive-record'... [OK]

> aluma extract dive-record examples/dive-records/*.pdf

File                   Name                     SSN
dive-record-0001.pdf   Jenna Horton             487-99-4312
dive-record-0002.pdf   Jarod Dunlop             004-31-5237
dive-record-0003.pdf   Mrs Sara R Moose         734-63-2750       
dive-record-0004.pdf   Chris M Spencer          690-07-9479
dive-record-0005.pdf   Margaret J Reinhart      225-19-0469

Some modules have parameters that allow you to configure their behaviour. Sometimes the parameters must be specified in order to use the module, other times they are optional and can be specified to modify the behaviour of the module.

In the next section we'll add a Date of Birth module to our extractor, which requires a locale parameter so it knows how to interpret short dates.

πŸ“˜

What modules are available?

You can view the details of the modules that are available under the Extraction Modules category here in the documentation.

You can also use the Aluma CLI list modules command.

Create an extractor from an extractor template

In order to use modules with parameters, we must create the extractor from an extractor template file in which we can specify the parameter values (we call these "arguments").

We'll recreate our dive-record extractor using a template and add a module to also capture date of birth.

First, let's delete the existing extractor:

aluma delete extractor dive-record

Now let's create an extractor template containing name, social security number and date of birth modules and write this to a file:

aluma create extractor-template aluma.name aluma.us_social_security_number aluma.date_of_birth -o dive-record-template.json

Open the dive-record-template.json file in an editor. You will see that it looks like this:

{
  "modules": [
    {
      "id": "aluma.name"
    },
    {
      "id": "aluma.us_social_security_number"
    },
    {
      "id": "aluma.date_of_birth",
      "arguments": {
        "locale": ""
      }
    }
  ]
}

The template has one section for each module we specified to include. The Date of Birth module has a required parameter called locale and a property has been created where we must specify the value ("argument") for this parameter.

The Dive Record documents contain US-formatted dates, so we need to specify en-US for the locale argument. Edit the locale argument accordingly and save the file:

...
      "id": "aluma.date_of_birth",
      "arguments": {
        "locale": "en-US"
      }
..

Now let's create a new extractor from the template file:

aluma create extractor from-template dive-record dive-record-template.json

Run data extraction on the sample documents again and now you will also get the dates of birth:

aluma extract dive-record examples/dive-records/*.pdf
File                   Name                     SSN             Date of Birth
dive-record-0001.pdf   Jenna Horton             487-99-4312     7/15/1982
dive-record-0002.pdf   Jarod Dunlop             004-31-5237     1/2/1980
dive-record-0003.pdf   Mrs Sara R Moose         734-63-2750     11/5/1979
dive-record-0004.pdf   Chris M Spencer          690-07-9479     12/11/1990
dive-record-0005.pdf   Margaret J Reinhart      225-19-0469     10/21/1995

Extract dive date and dive number

Now let's extend our extractor to capture the dive date and dive number from the documents.

To capture the dive date, we can use the Date module. This module can be used to capture dates and optionally restrict capture to dates near specific headings (in this case "Dive Date").

The capture the dive number, we can use the Reference Number module. This module can be used to capture any arbitrary number, code or other text with a format specified by a regular expression. Like the Date module, we can also restrict capture to numbers near specific headings (in this case "Dive Number").

We'll add these modules manually to the extractor template.

Open the extractor template file in an editor. Copy and paste these definitions for the new modules from the template below into the end of the modules list (or just copy and paste the whole contents of the template):

{
    "modules": [
    {
        "id": "aluma.name"
    },
    {
        "id": "aluma.us_social_security_number"
    },
    {
        "id": "aluma.date_of_birth",
        "arguments": {
            "locale": "en-US"
        }
    },
    {
        "id": "aluma.date",
        "arguments": {
            "locale": "en-US",
            "keywords": "Dive Date"            
        }
    },
    {
        "id": "aluma.reference_number",
        "arguments": {            
            "format": "DX\\d{4}",
            "keywords": "Dive Number"
        }
    }
    ]
  }

🚧

Escaping backslashes in regular expressions

Note that when specifying regular expressions in a template, you must escape any backslashes with an additional backslash, e.g. \\d rather than \d.

Now delete the old extractor:

aluma delete extractor dive-record

and recreate it from the template:

aluma create extractor from-template dive-record dive-record-template.json

Let's see what results we get (we've omitted some fields here for simplicity):

aluma extract dive-record examples/dive-records/*.pdf

You should see these results:

File                     Dive Date                Dive Number
dive-record-0001.pdf     9/6/2018                 DX9267
dive-record-0002.pdf     10/6/2018                DX9284
dive-record-0003.pdf     11/6/2018                DX9293
dive-record-0004.pdf     10/6/2018                DX9289
dive-record-0005.pdf     12/6/2018                DX9301