Extract data with a custom extractor
About this guide
In this guide you'll install the Extraction Builder Visual Studio extension and use it to build a simple custom extractor that will capture the "Maximum Depth" field from the Dive Record documents used in the previous guides.
Custom extractors are useful when there isn't an extractor module that does exactly what you need. With Extraction Builder you have access to a powerful toolkit of data extraction techniques and tools that you can easily piece together in simple or more complex ways to capture just what you need.
If you don't already have Visual Studio then working through this guide should take about 30 minutes. If you do have it then it should take you about 15 minutes.
Before you begin - install Visual Studio 2019
Extraction Builder is a Visual Studio 2019 extension, so you'll need a copy of Visual Studio 2019. If you don't already have one you can install the free Visual Studio Community edition. It only takes a few minutes to install.
Select the .NET desktop development workload, and include the Entity Framework 6 Tools component (you can deselect the other optional components if you wish).
Once you've installed Visual Studio, continue with the rest of this guide.
Install the Extraction Builder extension
- Close Visual Studio if it is open.
- Download the Aluma Extraction Builder installer.
- Double-click the installer (.vsix) file to run it and follow the steps presented to you.
- Open Visual Studio. You may be prompted to Sign in, but you don't need to - just click Not now, maybe later at the bottom of the dialog.
Create a custom extractor project
- Open Microsoft Visual Studio.
- Create a new project from the Empty Project (.NET Framework) project template (it doesn't matter whether you pick the C# or VB version) and call it
- Once the project has been created, go to the Solution Explorer window, right-click on the project (not the solution) and select Add | New Item...
- Add an Extraction item from the list (it will probably be at the bottom) and call it
custom-extractor.fpxl. This is your empty extraction configuration.
Using Visual Studio 2017
Visual Studio 2017 doesn't have the "Empty Project" template, so use the "Microsoft C# | Console App" template as it is one of the simplest. In the Solution Explorer, you can delete the Program.cs and App.config files as these are not necessary.
Find your way round the Extraction Builder
The Extraction Builder has three panels which you'll see when an extraction configuration is open. In a vanilla Visual Studio install these will be laid out like the screenshot below, but you can move them around if you want to:
- The Toolbox panel on the left contains all the components you can use to build custom extractor configurations.
- The Canvas panel in the middle holds the components that you've added to your configuration.
- The Properties panel in the bottom right is where you can edit the various properties associated with each component in your configuration.
You can also open the Extraction Explorer panel in the top right of the window. This shows a more structured view of the contents of your configuration and is especially useful in large configuration.
Create an extraction configuration
Now we'll create a very simple configuration that captures the "Maximum Depth" field from the Dive Record example documents we used in previous guides.
The Maximum Depth field looks something like this:
Max Depth: 12m. Let's start by simply searching for text which is one or more digits followed by
m and output that as a field:
- Drag a Search component from the Toolbox onto the canvas.
- Change its name to
Max Depth Valueby clicking on its name and typing the new name.
- Change the Regular Expression property of the Search component to
\d+min the Properties panel. This expression matches any text which is one or more digits followed by an
- Drag a Field component from the Toolbox onto the canvas and change its name to "Max Depth".
- Click the Provide Results/Parameter component in the Toolbox and drag from the Search component to the Field component.
Save and test
Now we'll save our configuration, upload it to the Aluma platform and test it using the Aluma CLI. If you don't have the CLI installed, pause here and follow the instructions in Working with the Aluma CLI to install it and log in.
- Save the extraction configuration by pressing Ctrl+S or the Save icon on the toolbar. A "compiled" version of the configuration is automatically created for you, with the extension
.fpxlc. This is the file we will use to create a new custom extractor in the Aluma platform.
- Open a terminal or command prompt and create a new extractor called
dive-record-customusing this command (changing the path to your project)
aluma upload extractor dive-record-custom c:\users\mark\source\repos\dive-record-custom\custom-extractor.fpxlc
- Test the new extractor on the dive record documents:
aluma extract dive-record-custom examples/dive-records/*.pdf
You should see extracted data results like this (the results may be in a different order):
|dive-record-0001.pdf||* 23m, 10m|
|dive-record-0002.pdf||* 56m, 12m|
|dive-record-0003.pdf||* 58m, 3m|
|dive-record-0004.pdf||* 32m, 5m|
|dive-record-0005.pdf||* 45m, 8m|
There's a problem! We've extracted the corrected values (
12m etc.) but also the "minutes" from the Dive Duration field. Because multiple values were found for the field it's also been rejected, as shown by the * before the values.
Let's adjust our configuration to fix that...
Adjust configuration and retest
We'll adjust our configuration so that it finds the "Max Depth" heading and then looks next to this for a value. To do this we'll add another search component and chain it with our existing one.
- Drag another Search component from the Toolbox onto the canvas and name it
Max Depth Heading.
- Change the Regular Expression property of this Search to
- Click the Logical Proximity Rule component in the Toolbox and drag from the
Max Depth HeadingSearch to the
Max Depth ValueSearch. This rule specifies that the value is expected to be logically "after" the heading in the document.
- Save the configuration (Ctrl+S).
Let's test our new configuration. First delete our previous Aluma extractor:
aluma delete extractor dive-record-custom
Now repeat what we did before - create an extractor from the configuration and extract the data from our documents:
aluma upload extractor dive-record-custom c:\users\mark\source\repos\dive-record-custom\custom-extractor.fpxlc aluma extract dive-record-custom examples/dive-records/*.pdf
Success! You should see extracted data results like this (the results may be in a different order):
Updated over 1 year ago