A Process to Extract the BCA Bank’s E-Statement Data Using Google Document AI

Didik Mulyadi
7 min readJun 15, 2024

--

Photo by Arisa Chattasa on Unsplash

We can use many methods to extract the data from the pdf file, Google provided a Document AI that specially extracts the data from the document. We will extract the data using Document OCR, Specialized Bank Statement Parser, and Custom Processor.

Google also has Vertex AI OCR Parser that can scan the PDF file.

Overview the Processor

Overview Document AI

We can create our custom processor or use an existing processor by clicking the button “Explore Processor”, it will bring you to the Processor Gallery page.

The difference is you can train the model with the custom processor, and can't train the model with the existing processor.

If your needs are the general solution you might fit with the existing processor, but if the file is not general e.g. BCA E-Statement you can go with the custom processor.

In this article, we will try the processor from the gallery and create a custom processor to extract the text from the PDF.

Existing Processor

In this case, we want to extract the text from the bank’s e-statement, so the fit model that can be used is Document OCR (General) and Bank Statement Parser (Specialized).

Document OCR

We are not expecting the result to be accurate, because this is the general OCR. Let’s create the processor with Document OCR, for example, select the US region.

Document AI — Document OCR — Create Form

After that, you will see the details of it

Document AI — Document OCR — Basic Information

Let’s continue to upload the test document, and see how the model analysis of it

BCA E-Statement Analysis Result

As we can see, the model selected every text from the uploaded PDF file. In the sidebar, we can see the list of the selected text. Then click the “Extract JSON” button in the navbar to see the structured data.

JSON Result from the Document OCR Analysis

Result:

  1. The “text” key contains all selected text.
  2. Some scanned texts are incorrect both the text selection and order.
  3. The “entities” property does not exist so it’s hard to manipulate the data to get specific data.

Bank Statement Parser

if we want to try this model we need to request access to Google, because it's a private program.

Document AI — Bank Statement Parser — Create Form

They redirected me to the Google form.

Request Access Link

Let’s skip this and move to the custom processor.

Custom Processor

Custom processors allow us to train the model until it fits our e-statement format. The selection result is accumulated in the “entities” property, so we can easily get the specific data.

Create Custom Extractor

Document AI — Workbench

Let’s create a “Custom Extractor”

This is an overview of the created custom processor

Document AI — Custom Processor Overview

Define the Fields

Let’s get started and upload the test document

Document AI — Custom Processor Get Started

After the file is uploaded, you will see the same analysis result screen, the difference is you need to define the field so you will get the only thing that you need.

Document AI — Create Field

When you create the fields, AI will help you by automatically selecting the text. for example, when I add an account_number/transaction_title field, AI automatically selects the account number area and if it’s correct, you need to check it.

AI Suggesting the Area for the field

Do not forget to always take action to the purple selection (AI Suggestion), either confirm, delete, or modify it.

In my case, “Saldo Awal” is not selected, so I need to “Add instance” and then click “Annotate” to draw the area covering that text.

Draw Missing Area

These are the created fields:

  1. account_information (type: Plain text, occurrence: Required once)
  2. account_number (type: Plain text, occurrence: Required once)
  3. period (type: Datetime, occurrence: Required once)
  4. transaction_date (type: Datetime, occurrence: Optional multiple)
  5. transaction_title (type: Plain text, occurrence: Optional multiple)
  6. transaction_detail (type: Plain text, occurrence: Optional multiple)
  7. transaction_amount (type: Number, occurrence: Optional multiple)
  8. transaction_type (type: Plain text, occurrence: Optional multiple)
  9. current_balance(type: Number, occurrence: Optional multiple)

Make sure all of the text that you want to be extracted has already covered in the all page . If you only did in the page 1, the result will not be accurate

After you completely add those fields, save it by clicking this button “MARK AS LABELED”, then you will see the summary fields

Document AI — After Add The Fields

Let’s go to the next step

Build and Evaluate the Model

Document AI — Build Page — Import Document

Let’s import the document for training and testing, I use the same document. Then, if the button “start labeling” is still active, please do a labeling to cover all dataset

Document AI — Build Page — Finish Labeling

After you finish the labeling field, the “Unlabeled” should be 0. Then, click “Create New Version” in the Call Foundation Mode section, and name the version to “version-1–0–0”. You will see the bottom snack bar notification.

Bottom Snackbar “Creating Version version-1–0–0”

After that, you can see the finished notification

finish notification

Let’s move to evaluate the version.

Deploy & Use

Make sure the version is already deployed, if not, you can click the deploy button.

Then let’s move the evaluate page, to test it.

Evaluate and Test

Select your processor version, and run a new evaluation.

It will show you the details of the version

Then, you can test it again by uploading the test document to see the result. If it's not enough, you add more document types or the label in the section “Improve your processor”.

Build — Fine Tuning and Train

We should improve the model to increase the accuracy, it depends on how much data training and labels you use. If you want to do find tuning or train, there are some requirements to do this, let's see the requirements.

Tuning Requirements

We should have at least 10 documents for training and testing, you need to prepare that. After you upload it, you need to create a label for the document again.

Evaluation

It still takes time and the result is still not accurate, so, I thought to try another selection method

  • Select the entire row transaction instead of per text and then manipulate the data per row to get the transaction date, title, detail, amount, type, and balance.

Try Different Selection Methods

Following the same flow until you create and deploy the version.

The difference is the fields, currently, we only use the fields bank_name (type: Plain text, occurrence: Optional multiple) and transaction_row (type: Plain text, occurrence: Optional multiple).

That’s simple.

Deploy the Version

After the version is created, it will shown in the list deployment and you need to deploy it.

Consume the Version

After the version is deployed, you can send a request to that version. There is a sample request button to see how to interact with the version.

Copy the “Prediction Endpoint” in the detail’s version.

Following that, I try it with the Postman

I got the JSON file from that

Extract the Transaction From The JSON

The field that we created before will be shown in the “entities” property

Document AI — Custom Processor — JSON Structure

These are the samples of the entities' values.

bank_name field
transactions field

Now I want to get an array of string transactions, I manipulate the JSON with this code

const documentAIResponse = {}; // the response
const transactions = documentAIResponse.document.entities
.filter((e) => e.type === "transactions")
.map((e) => e.mentionText.replaceAll("\n", " "));

Conclusion

We have already succeeded in extracting the BCA’s E-Statement, then in the next article, we will categorize or classify the transaction based on the defined categories e.g. Transfer, Food, Clothes, etc.

Thank you for reading my article!
Reach me on Linkedin: https://www.linkedin.com/in/didikmulyadi

--

--