Extract data with a custom extractor

About this guide

In this guide you'll install the Extraction Builder Visual Studio extension and use it to build a simple custom extractor that will capture the "Maximum Depth" field from the Dive Record documents used in the previous guides.

Custom extractors are useful when there isn't an extractor module that does exactly what you need. With Extraction Builder you have access to a powerful toolkit of data extraction techniques and tools that you can easily piece together in simple or more complex ways to capture just what you need.

If you don't already have Visual Studio then working through this guide should take about 30 minutes. If you do have it then it should take you about 15 minutes.

Extraction Builder in Visual Studio 2019 (click to enlarge)Extraction Builder in Visual Studio 2019 (click to enlarge)

Extraction Builder in Visual Studio 2019 (click to enlarge)

Before you begin - install Visual Studio 2019

Extraction Builder is a Visual Studio 2019 extension, so you'll need a copy of Visual Studio 2019. If you don't already have one you can install the free Visual Studio Community edition. It only takes a few minutes to install.

Select the .NET desktop development workload, and include the Entity Framework 6 Tools component (you can deselect the other optional components if you wish).

Install the free VS2019 Community Edition (click to enlarge)Install the free VS2019 Community Edition (click to enlarge)

Install the free VS2019 Community Edition (click to enlarge)

Select .NET desktop development workload and Entity Framework 6 Tools component (click to enlarge)Select .NET desktop development workload and Entity Framework 6 Tools component (click to enlarge)

Select .NET desktop development workload and Entity Framework 6 Tools component (click to enlarge)

Once you've installed Visual Studio, continue with the rest of this guide.

Install the Extraction Builder extension

  1. Close Visual Studio if it is open.
  2. Download the Aluma Extraction Builder installer.
  3. Double-click the installer (.vsix) file to run it and follow the steps presented to you.
Install the Extraction Builder extension (click to enlarge)Install the Extraction Builder extension (click to enlarge)

Install the Extraction Builder extension (click to enlarge)

  1. Open Visual Studio. You may be prompted to Sign in, but you don't need to - just click Not now, maybe later at the bottom of the dialog.
You do not need to sign in to Visual Studio (click to enlarge)You do not need to sign in to Visual Studio (click to enlarge)

You do not need to sign in to Visual Studio (click to enlarge)

Create a custom extractor project

  1. Open Microsoft Visual Studio.
  2. Create a new project from the Empty Project (.NET Framework) project template (it doesn't matter whether you pick the C# or VB version) and call it dive-record-custom.
  1. Once the project has been created, go to the Solution Explorer window, right-click on the project (not the solution) and select Add | New Item...
  2. Add an Extraction item from the list (it will probably be at the bottom) and call it custom-extractor.fpxl. This is your empty extraction configuration.

📘

Using Visual Studio 2017

Visual Studio 2017 doesn't have the "Empty Project" template, so use the "Microsoft C# | Console App" template as it is one of the simplest. In the Solution Explorer, you can delete the Program.cs and App.config files as these are not necessary.

Find your way round the Extraction Builder

The Extraction Builder has three panels which you'll see when an extraction configuration is open. In a vanilla Visual Studio install these will be laid out like the screenshot below, but you can move them around if you want to:

  • The Toolbox panel on the left contains all the components you can use to build custom extractor configurations.
  • The Canvas panel in the middle holds the components that you've added to your configuration.
  • The Properties panel in the bottom right is where you can edit the various properties associated with each component in your configuration.

You can also open the Extraction Explorer panel in the top right of the window. This shows a more structured view of the contents of your configuration and is especially useful in large configuration.

The Extraction Builder panels (click to enlarge)The Extraction Builder panels (click to enlarge)

The Extraction Builder panels (click to enlarge)

Create an extraction configuration

Now we'll create a very simple configuration that captures the "Maximum Depth" field from the Dive Record example documents we used in previous guides.

The Maximum Depth field looks something like this: Max Depth: 12m. Let's start by simply searching for text which is one or more digits followed by m and output that as a field:

  1. Drag a Search component from the Toolbox onto the canvas.
  2. Change its name to Max Depth Value by clicking on its name and typing the new name.
  1. Change the Regular Expression property of the Search component to \d+m in the Properties panel. This expression matches any text which is one or more digits followed by an m.
  1. Drag a Field component from the Toolbox onto the canvas and change its name to "Max Depth".
  1. Click the Provide Results/Parameter component in the Toolbox and drag from the Search component to the Field component.

Save and test

Now we'll save our configuration, upload it to the Aluma platform and test it using the Aluma CLI. If you don't have the CLI installed, pause here and follow the instructions in Working with the Aluma CLI to install it and log in.

  1. Save the extraction configuration by pressing Ctrl+S or the Save icon on the toolbar. A "compiled" version of the configuration is automatically created for you, with the extension .fpxlc. This is the file we will use to create a new custom extractor in the Aluma platform.
  2. Open a terminal or command prompt and create a new extractor called dive-record-custom using this command (changing the path to your project)
aluma upload extractor dive-record-custom c:\users\mark\source\repos\dive-record-custom\custom-extractor.fpxlc
  1. Test the new extractor on the dive record documents:
aluma extract dive-record-custom examples/dive-records/*.pdf

You should see extracted data results like this (the results may be in a different order):

FileMax Depth
dive-record-0001.pdf* 23m, 10m
dive-record-0002.pdf* 56m, 12m
dive-record-0003.pdf* 58m, 3m
dive-record-0004.pdf* 32m, 5m
dive-record-0005.pdf* 45m, 8m

There's a problem! We've extracted the corrected values (10m, 12m etc.) but also the "minutes" from the Dive Duration field. Because multiple values were found for the field it's also been rejected, as shown by the * before the values.

Let's adjust our configuration to fix that...

Adjust configuration and retest

We'll adjust our configuration so that it finds the "Max Depth" heading and then looks next to this for a value. To do this we'll add another search component and chain it with our existing one.

  1. Drag another Search component from the Toolbox onto the canvas and name it Max Depth Heading.
  2. Change the Regular Expression property of this Search to Max Depth.
  3. Click the Logical Proximity Rule component in the Toolbox and drag from the Max Depth Heading Search to the Max Depth Value Search. This rule specifies that the value is expected to be logically "after" the heading in the document.
  1. Save the configuration (Ctrl+S).

Let's test our new configuration. First delete our previous Aluma extractor:

aluma delete extractor dive-record-custom

Now repeat what we did before - create an extractor from the configuration and extract the data from our documents:

aluma upload extractor dive-record-custom c:\users\mark\source\repos\dive-record-custom\custom-extractor.fpxlc
aluma extract dive-record-custom examples/dive-records/*.pdf

Success! You should see extracted data results like this (the results may be in a different order):

FileMax Depth
dive-record-0001.pdf10m
dive-record-0002.pdf12m
dive-record-0003.pdf3m
dive-record-0004.pdf5m
dive-record-0005.pdf8m