Case Study: Legal. Supplementing Human Judgement with Machine Learning

In the first post of this three-part series, I examined how machine learning (ML) can supplement the data analysis capabilities of humans by leveraging a computer’s ability to rapidly analyze, sort and curate massive amounts of data. As a real-world example of this hybrid human/machine model, I used the advertising market to illustrate how machine learning, combined with the domain expertise of humans experienced in managing ad campaigns, can work together to provide significant improvements to an ad campaign’s performance in near real time. The ability to analyze large amounts of data very quickly allowed advertisers to adjust their campaigns to better target audiences, even while the campaign was in progress. For this post, I’d like to apply the benefits of machine learning to another vertical market: the legal system.

As anyone living in modern society knows, nothing generates more forms and documents than the legal system. To illustrate, consider the legal processes involved in immigration law. Foreigners wishing to become U.S. citizens or apply for a visa have to deal with an extensive legal and government bureaucracy, and the primary means for interacting with this bureaucracy is by completing various forms and submitting them in hard copy via snail mail. Here at SpringML, we’ve contacted several large law firms specializing in immigration, and they advised that it’s not uncommon for them to receive up to 400 new forms every day from clients and immigration authorities that contain data they need to capture in databases.

Once received, the data captured in these forms gets added to databases using one of two methods: having paralegals conduct manual data entry or by scanning the documents and then using optical character recognition (OCR) software to analyze the data and populate it where appropriate in a database. Obviously, the first method is the most time and resource intensive, so the preferred method would seem to be scanning the documents and automating the data curation process.

However, the OCR software used in this process can be difficult to maintain as it needs to be constantly updated to accommodate collecting data from new or changing forms. Developing scripts to handle new forms can take IT a few days to create, and delays like this make using OCR in a fast-paced office environment a challenge. The takeaway here is that the moment a human gets involved, either to enter data or write a script to automate that work, the entire data management process slows to unacceptable levels.

But with the proper use of ML in this application, the system used in the scanning process can learn to recognize the data captured in the fields of an immigration form and then add it to the appropriate field in a database. For example, say two British nationals are each asked to complete a form that has a field where they need to list their country of origin; one applicant might list “UK” while the other lists “Great Britain.” While a human would quickly realize that both terms refer to the same country and enter the name into the database using the law firm’s preferred nomenclature, a machine doesn’t have the domain expertise to do this. But once a script was developed that instructed the OCR software how to handle different names for the same country, a human wouldn’t have to review the document in the future as ML would let the systems automatically populate the database with the right country name, eliminating the bottle neck in the process caused by the need for human review.

Better still, if a new immigration form is received, ML can help knowledge gained from past experience to identifying and sorting data from a new form. Specifically, if the system can apply ML to identify data by the format it’s presented in, say a birthdate presented in a middle-endian format (Month/Day/Year), it could then apply that knowledge when scanning a new form to help automate the adding of that date to its corresponding field in the database. Better still, if the date format used in the birthdate field is different than the format the firm uses in its databases, ML could help the machine recognize this and re-order the data in the field to match the format used in the database. In our immigration law firm example, automating the process to standardize dates in the firm’s database could save significant time as countries often use different date formats.

Next week I’ll publish our final case study illustrating how ML can speed up the data analytics process.