5 min read

Leveraging historical data to train document parsers at scale

Ståle Zerener

Last updated

•

23 feb 2023

This step-by-step guide is intended for organizations that process large volumes of documents, and will take you through the necessary steps for using historical data to train a customized document parser using Cradl AI.

Benefits of using historical data

By historical data, we mean documents that that have been processed as part of an existing process and stored in a database.

‍

For example, lets you're an accounts payable software provider that process invoices on behalf of your customers. In this case, chances are that you can extract these invoices together with their corresponding values such as the amount and due date with a couple of well-crafted commands or queries. Taking advantage of this historical data enables you to:

Train a powerful document parser on thousands or even millions of documents
Optimize your data extraction models for the layouts you process the most
Customize your model to extract all the fields you need

By leveraging historical data, you're able to do all this without annotating a single document by hand.

Preparing your historical data for training

In order to prepare your data for training, you need to convert it to Cradl AI's training data format. For each document, you need a corresponding JSON file with values for each field you have defined in your model. The figure below illustrates what a correctly formatted training dataset looks like.

‍

Notice that since Cradl does not require you to provide any positional metadata such as bounding box coordinates in your training data, you can create training data programmatically by dumping data from your database. By doing so you can effectively train on thousands or even millions of documents without annotating a single document!

Example: Training an invoice parser

In this example we'll assume that you want to train a model for processing invoices, and that you have a PostgreSQL database with a table called invoices where total_amount, due_date, etc. are columns which denote the total amount, due date, etc. We'll also assume that you have a column called s3_path which contains a path to an Amazon S3 object where each invoice is stored.

Step 1: Extract invoices from your database

Dump the invoices table for example by using PostgreSQL’s CLI. Then convert these rows to Cradl’s training data format by using your favorite scripting language. Make sure that the JSON files are on the following format:

‍

‍Step 2: Download your invoices from Amazon S3

The next step is to download the invoices. We’ll assume here that they’re stored on Amazon S3 in a bucket called my-bucket and that the object key for each invoice is the same as the name of the JSON files we just constructed (minus the file extension). We'll download the files to the same folder as the JSON files that we just created. A few shell commands does the trick:

‍

This assumes that you have the AWS CLI, jq and sed installed.

Step 3: Create a new dataset

Now that you have constructed a dataset which can be used to train a model in Cradl, you can upload it either through the UI or by using the Cradl CLI. The Cradl CLI takes care of multi-threading and also handles interruptions gracefully, and is recommended for larger datasets.

‍

If you haven’t installed it yet, you can install it using pip (pip install lucidtech-las-cli). Now you're ready to upload the data to Cradl AI.

‍

Step 4: Train your model

Now you’ve created a dataset from historical data and are ready to train your model in Cradl. If you haven’t done so already, you can create a new model in Cradl, add the fields you’ve included in your dataset and train your model.