Processing Fraud Claims

Problem:

Unemployment fraud is on the rise as people file to collect unemployment.

General Solution:

With the overwhelming number of claims, public agencies are turning to machine learning to identify these fraudsters. That way, they can give payment to the legitimate people who actually need it.

1. Look for patterns of fraud in the dataset
2. Turn them into features that we can use for the machine learning model
3. Use the probability score from the model output to give the claim a fraudulent score

Patterns of Fraud

Here are some examples of fraud patterns that we encountered:

Invalid phone numbers: 123-123-1234
Invalid social security number: 999-99-999
Out of country IP address: Nigeria
Special characters in names: John Smith999
Duplicate Email and Phone Number
Disposable Email: yopmail or protonmail
Same static IP addresses used on multiple claims
Nonlocal bank accounts
Inactive accounts sometimes only after one day
No record of employment/wage history

As you can see, there are many ways that fraudsters are trying to game the system to receive unemployment benefits.

Convert Features for Machine Learning

Many of these features can be converted into O’s and 1’s. For example, a valid social security number can be a 0 and a fraudulent email can be a 1. We chose 1 as a fraud because that is the direction we want the model to predict.

Other things such as the last time someone logged in can be used as a predictive feature for the model.  We noticed if someone is truly filing for unemployment, that person would be logging frequently to fill out paperwork and checking up on the claim status.

We have a hypothetical example of a claimant:

And how the information would look like for a machine learning algorithm:

Trained Model

Once the model is trained as a classification problem, we can use the raw confidence score to display how fraudulent a claim is.

Below is a hypothetical example of a trained model scoring fraud:

As you can see, a misspelled the last name in the first row raises a fraudulent flag and each subsequent flag accumulates a higher fraud score in the second row. But, in the last example, there are only 2 flags raised, but the fraud score is high because a fraudulent email and phone number combination are weighted higher than the other features in this hypothetical example.

Some of the great tools in GCP is BigQuery to pull the store data and connect it to Data Studio or a Looker dashboard.

If you are experiencing similar issues with fraud claims, we would love to help analyze them and create an automated system to help save you time and money for identifying fraudulent ones.