Test a custom extractor

About this guide

In this guide we'll install Document Studio, use it to test a custom extractor and learn how to use it together with Extraction Builder to quickly make and test changes to your extractor.

In the Extract data with a custom extractor guide we used the Aluma CLI to test the extractor. Document Studio is a desktop tool that enables you to import a set of documents, run your extractor on them, view the results alongside the document images and debug your extractor.

Working through this guide should take about 20 minutes.

Before you begin

If you have not worked through the Extract data with a custom extractor guide you should do that first.

Document Studio is a tool that you'll install locally. Your computer must meet the following minimum specifications:

  • Windows 7 or Windows 10
  • Dual core CPU
  • 8GB RAM

๐Ÿšง

You must have Administrator permissions

To install Document Studio, you will need Windows Administrator permissions sufficient to install applications and services. If you don't have these, Windows will tell you when you try to install.

Install Document Studio

  1. Download the latest version of Document Studio and unzip the package.
  2. Windows 7 only: Browse to the "Prerequisites" folder and install Microsoft .NET Framework 4.6.2 and the Microsoft Visual C++ 2015 Redistributable
  3. Browse to the "Licensing" folder.
  4. Run Install_AlumaDesktopLicenseManager.bat to install the local tools license service. If Windows Defender displays a "Windows protected your PC" message then click in "More info" and then "Run anyway". On some computers the installation takes a few minutes to complete.

If the settings of your computer prevent the execution of batch files, you may need to open a Windows command prompt as Administrator and run AlumaDesktopLicenseManager.exe -i

  1. Run AlumaDesktop_Eval_60days_20k_credits.exe to install a 60 day/20,000 credits trial license for Document Studio. If Windows Defender displays a "Windows protected your PC" message then click in "More info" and then "Run anyway". Then click the "Apply Update" button in the dialog that is displayed (the "License File" section will be empty).

  2. Browse to the "bin" folder and double-click DocumentStudio.exe to open Document Studio. You may wish to create a shortcut to this file and add it to your desktop, start menu or taskbar.

Import the example documents

Now you've installed Document Studio, let's import the documents you used in the "Test a custom extractor" guide:

  1. From the File menu, select Import Folder.
  2. Browse to wherever you saved the example documents and select the "dive-records" folder.
  3. When the Import Documents dialog is displayed, click Import without changing any settings.

The documents are imported into Document Studio and are listed in the documents panel by ID (import order). Clicking on a document's row in the panel displays the document in the viewer.

You can see the file path for a selected document either in the statusbar at the bottom of the window, or by enabling the File column in the documents panel: Right-click in the top left corner of the grid to enable the context menu, and select File in the list of visible columns.

Check that the documents have content

Testing extractors in Document Studio relies on your documents already having some content. For scanned documents this means they must already have been read and imported as PDFs that include the content.

To verify that the documents have content, from the Document menu select Text View (or press Ctrl + T).

The document viewer switches from the document's image to its text. You can restore the image view by de-selecting the Text View menu or pressing Ctrl + T again.

Test the extractor on the documents

Now let's load the extractor you created in Extraction Builder in the Extract data with a custom extractor guide:

  1. From the Extraction menu, select Load Extraction Configuration.
  2. Browse to your Visual Studio project for extraction, select the extractor's .fpxlc file, then click Open.

The extractor is now loaded and we can test the extractor on the documents we've imported.

To test the extractor, from the Extraction menu, select Run All.

The extraction results are displayed in the documents panel, with a column for each field in the extractor.. You may need to scroll the documents panel to the right to see the results. Alternatively you can move the splitter between the documents panel and the document viewer to give the panel more of the screen.

๐Ÿ“˜

Note

Once you have loaded an extractor into Document Studio it is re-loaded automatically any time it changes, so if you also have the extractor open in Extraction Builder just save it (Ctrl+S) and then you can return to Document Studio and run another test without re-loading it manually.

You can clear the extraction results from the panel by selecting the Extraction | Clear All Results menu.

Customise the columns in the documents panel

The documents panel can display a variety of different information for each document:

  • Document properties (filename, number of pages, number of words)
  • Extraction results
  • Properties helpful when building or optimising classifiers (document type, status, best guess etc.)

When using Document Studio to test extractors, it's often helpful to hide the default columns related to classifiers (Document Studio's other main use).

To customise the columns that are displayed, right-click in the top-left cell of the documents panel and select or de-select from the list of available columns that is displayed.