Automated Document Text Extraction for BAL Global


giphy (2)


Every organization needs to process physical documents in a timely fashion – these could be invoices to be paid, legal documents that require follow up or medical forms.  Typically such documents are manually processed with humans inspecting each document and entering data into a document management system.  This process can be time consuming, tedious and error prone.

SpringML’s automated document processing system allows organizations to automatically process physical documents by extracting and saving text from them.  This allows users to take action on them immediately.  Such a solution can be integrated into existing workflows so that approvals or validations can be honored and handled expeditiously.  Because of its automated nature, additional benefits such as rules based notifications, alerts, reminders and text translation, etc can be configured to provide users with real time insights.

The document text extraction and processing system has application across many industries, leading to workflow automation and process reengineering. The legal industry, which relies heavily on paper documents could easily benefit from the ease and convenience of such an innovative solution.

BAL Global, one of the world’s largest global corporate immigration law firms, receives many thousands of documents during the normal course of business.  Processing the data from these documents can be intensive manual work, which should be supported by smart systems.

SpringML’s document processing solution can enable us to augment our staff and processes with Machine Learning.  The system delivers the capability to enhance our client service offerings through supplementing quality review processes with smart systems, while at the same time reducing paper processing time.  This allows our legal teams to take action faster for our clients.  This solution is versatile in that it handles several types of documents. It also provides details on data extraction accuracy, which allows us to be confident of the results.”

—Vince DiMascio, CIO, BAL Global

Here’s a brief overview of the data pipeline involved.


Once documents are scanned they are uploaded to Google Cloud Storage as PDF documents.

Each file corresponds to a specific document type but each file could have more than one document in it.  Since the scan is done manually the quality is not always consistent so some level of pre-processing is needed where the quality of image is enhanced by sharpening, removing any lines, etc.  Once images are ready Google’s Vision API is called to get the text.  Several text processing rules are then executed to validate the text and log any errors.  Results are stored in BigQuery for any analysis and reporting.

This solution helps bring in efficiencies to organizations that handle large number of documents.  Every document, even if it is of the same type (e.g., an invoice), has a different layout. It’s important that such solutions are able to handle these differences.  This solution is built on a framework that makes it possible to handle various types of invoice layouts as well as other types of documents.

If this is of interest and you would like to know more from SpringML, BAL Global or Google, we invite you to an afternoon of applied machine learning and networking on Dec 19th where Vince DiMascio, CIO of BAL Global will be speaking. Click here to find out about other speakers and register.

Dec 19, 3:00-7:00 PM PT, Google campus, Sunnyvale, CA.

Register here.