Extract data using modules
About this guide
Aluma makes it easy to extract data from any documents, including from images that first require OCR.
In this guide we'll extract a variety of data from some sample documents using an extractor built from Aluma's extraction modules.
Working through this guide should take about 10 minutes.
Other ways to configure data extraction
You can also create custom extractors using the Extraction Builder Visual Studio plugin. This approach gives you access to the full toolkit of data extraction capabilities if you need something more customised than available through extraction modules. The Extract data with a custom extractor guide provides an introduction to the Extraction Builder, and the Custom data extraction section of the docs provides more information on how to build custom extractors.
Before you begin
Before you start you must have:
- Installed the Aluma CLI and logged in to connect it to your account
- Installed the example documents
If you have not done these steps, follow the Getting started guide and then return to this one.
Recap: Create an extractor directly from basic modules
In the Getting started guide you created an extractor that extracts the names and social security numbers from the Dive Record documents.

As a reminder, here's how we did that (you don't need to run these commands again now):
> aluma create extractor from-modules dive-record aluma.name aluma.us_social_security_number
Creating extractor 'dive-record'... [OK]
> aluma extract dive-record examples/dive-records/*.pdf
File Name SSN
dive-record-0001.pdf Jenna Horton 487-99-4312
dive-record-0002.pdf Jarod Dunlop 004-31-5237
dive-record-0003.pdf Mrs Sara R Moose 734-63-2750
dive-record-0004.pdf Chris M Spencer 690-07-9479
dive-record-0005.pdf Margaret J Reinhart 225-19-0469
Some modules have parameters that allow you to configure their behaviour. Sometimes the parameters must be specified in order to use the module, other times they are optional and can be specified to modify the behaviour of the module.
In the next section we'll add a Date of Birth module to our extractor, which requires a locale
parameter so it knows how to interpret short dates.
What modules are available?
You can view the details of the modules that are available under the Extraction Modules category here in the documentation.
You can also use the Aluma CLI
list modules
command.
Create an extractor from an extractor template
In order to use modules with parameters, we must create the extractor from an extractor template file in which we can specify the parameter values (we call these "arguments").
We'll recreate our dive-record
extractor using a template and add a module to also capture date of birth.
First, let's delete the existing extractor:
aluma delete extractor dive-record
Now let's create an extractor template containing name, social security number and date of birth modules and write this to a file:
aluma create extractor-template aluma.name aluma.us_social_security_number aluma.date_of_birth -o dive-record-template.json
Open the dive-record-template.json
file in an editor. You will see that it looks like this:
{
"modules": [
{
"id": "aluma.name"
},
{
"id": "aluma.us_social_security_number"
},
{
"id": "aluma.date_of_birth",
"arguments": {
"locale": ""
}
}
]
}
The template has one section for each module we specified to include. The Date of Birth module has a required parameter called locale
and a property has been created where we must specify the value ("argument") for this parameter.
The Dive Record documents contain US-formatted dates, so we need to specify en-US
for the locale
argument. Edit the locale
argument accordingly and save the file:
...
"id": "aluma.date_of_birth",
"arguments": {
"locale": "en-US"
}
..
Now let's create a new extractor from the template file:
aluma create extractor from-template dive-record dive-record-template.json
Run data extraction on the sample documents again and now you will also get the dates of birth:
aluma extract dive-record examples/dive-records/*.pdf
File Name SSN Date of Birth
dive-record-0001.pdf Jenna Horton 487-99-4312 7/15/1982
dive-record-0002.pdf Jarod Dunlop 004-31-5237 1/2/1980
dive-record-0003.pdf Mrs Sara R Moose 734-63-2750 11/5/1979
dive-record-0004.pdf Chris M Spencer 690-07-9479 12/11/1990
dive-record-0005.pdf Margaret J Reinhart 225-19-0469 10/21/1995
Extract dive date and dive number
Now let's extend our extractor to capture the dive date and dive number from the documents.

To capture the dive date, we can use the Date module. This module can be used to capture dates and optionally restrict capture to dates near specific headings (in this case "Dive Date").
The capture the dive number, we can use the Reference Number module. This module can be used to capture any arbitrary number, code or other text with a format specified by a regular expression. Like the Date module, we can also restrict capture to numbers near specific headings (in this case "Dive Number").
We'll add these modules manually to the extractor template.
Open the extractor template file in an editor. Copy and paste these definitions for the new modules from the template below into the end of the modules list (or just copy and paste the whole contents of the template):
{
"modules": [
{
"id": "aluma.name"
},
{
"id": "aluma.us_social_security_number"
},
{
"id": "aluma.date_of_birth",
"arguments": {
"locale": "en-US"
}
},
{
"id": "aluma.date",
"arguments": {
"locale": "en-US",
"keywords": "Dive Date"
}
},
{
"id": "aluma.reference_number",
"arguments": {
"format": "DX\\d{4}",
"keywords": "Dive Number"
}
}
]
}
Escaping backslashes in regular expressions
Note that when specifying regular expressions in a template, you must escape any backslashes with an additional backslash, e.g.
\\d
rather than\d
.
Now delete the old extractor:
aluma delete extractor dive-record
and recreate it from the template:
aluma create extractor from-template dive-record dive-record-template.json
Let's see what results we get (we've omitted some fields here for simplicity):
aluma extract dive-record examples/dive-records/*.pdf
You should see these results:
File Dive Date Dive Number
dive-record-0001.pdf 9/6/2018 DX9267
dive-record-0002.pdf 10/6/2018 DX9284
dive-record-0003.pdf 11/6/2018 DX9293
dive-record-0004.pdf 10/6/2018 DX9289
dive-record-0005.pdf 12/6/2018 DX9301
Updated almost 3 years ago