Einstein Discovery: Diagnosing and Treating Domain Bias in Training Data

Einstein Discovery amalgamates statistical analysis and supervised learning with your business’s data, providing automated exploratory data analysis and model tuning. This can ultimately streamline large components of a business’s predictive modeling process.

Einstein Discovery is among a handful of such powerful tools that empower business teams to implement predictive modeling use cases without the necessity of a comprehensive data science team. We will outline how to help interpret the concept of biased data, as it is important to ensure Einstein Discovery models are built to optimize bias and variance levels. Furthermore, working with clean data yields a model with higher pedigree (better accuracy, insights and predictions). For every predictive model prior to implementation, it is imperative to ask oneself the following question:

“Is my training data biased and/or are my variables biased; and if so, what could I do to remove this bias? We will attempt to identify and diagnose bias within Einstein Discovery in the following use case: ‘Opportunity Scoring’.

Opportunity Scoring – a binary logistic regression use case, is one among such models that Einstein Discovery can be leveraged effectively. This model is highly pivotal in monitoring sales growth and sustenance by the respective sales teams. Accurately determining the probability of a sales opportunity success can help sales resources channel their energies on high-leverage opportunities; while simultaneously addressing other opportunities with preemptive domain knowledge (or otherwise known as non-model influenced decisions). So, we attempt to solve the stated bias issues in our training dataset which is listed in the below problem statement

Problem Statements:

  1. Biased Sample: Imbalance of response factors in training dataset due to abundance of “Closed/Won” Opportunities, compared to “Closed/Lost” Opportunities.
  2. Biased Variable (the strongest variable): Opportunity Age – reflecting the number of days it takes for a sales opportunity to close. This specific variable contains strange interactions with the imbalanced response variable.

The imbalance of “Closed/Won” opportunities to “Closed/Lost” opportunities will develop heavier weighting in the coefficient estimates within a classification model. Identifying this imbalance, allows us to treat (or not treat) domain bias after clarifying with the applicable business units.

Einstein Discovery-Ram
Biased training dataset with abundant cultivated (“Closed/Won”) Opportunities

Closed/Won: An opportunity is considered “Closed/Won” when a deal is finalized with a customer, thereby agreeing upon the purchase of a product. Also, this occurs when a potential lead is converted to a revenue-generating customer with formalized contracts in most cases.
*Closed/Lost: An opportunity is considered “Closed/Lost” when a deal is lost and the concerned parties have not agreed to purchase the respective product / service.
*Open Opportunities: Opportunities that sales teams are currently pursuing. These are considered in open stages of a sales cycle, yet to be “Closed/Won” or “Closed/Lost”.

There is a clear factor imbalance. This could be for a number or reasons: Salesforce partners that have recently migrated to Salesforce have little historical data, there was an error in migrating the client’s historical opportunity data to Salesforce, or sales teams who are still acclimating to Salesforce could be failing to capture all data. In the latter scenario, we observed a phenomenon that occurred with the data capture process – sales representatives were only logging surefire wins that they pre-determinedly knew will close within the next day or two. As a result, we ended up with training data that contains more frequent instances of “Closed/Won” opportunities than “Closed/Lost” opportunities.

Excluding Opportunities that are Adding Bias to our Model (Undersampling):

A method of understanding the factor imbalance is by performing EDA. Opportunity Age is a great example of a variable in our training dataset that interacted with our response variable strongly. We noticed that the distribution of Opportunity Age in “Closed/Won” Opportunities was skewed. Additionally, many of these “Closed/Won” Opportunities show closure in under a week – pointing at the aforementioned interaction.

We begin treating this imbalance by bounding the limits of Opportunity Age which are not ideal for model training. Identifying and omitting skewed data which additionally does not fit into the business case, prevents underfitting when attempting to model for the business’s given needs. Upon further analysis, we found many opportunities are only generated in Salesforce when a sales representative is sure that it will close shortly after. These particular opportunities have an age of 2 days (indicated below). This is a prime example of selection bias within the data capture, and would not be helpful to generalize a predictive model by including these observations. These opportunities are insignificant, as they are always destined to be won. Including these opportunities would put undue and impractical burden on the open sales pipeline. Lastly, even if sales representatives continue to log these short-term wins, they wouldn’t be included in the future training dataset; and, they wouldn’t be active long enough for the model to recommend decisions.

Einstein Discovery-Ram-1
“Closed/Won” Opportunities, with Opportunity Age Less than 2 Days

After consulting with the business teams involved in this process and obtaining their approval, we have decided to exclude the opportunities that have closed within 2 days. While making this decision to omit data (as doing so could present a bias of its own), we have diligently ensured that we’ve met acceptable domain criteria. Any missed observations could enhance the variance captured by the model, so it’s important to re-evaluate these model assumptions constantly.

By dropping the observations that were labeled as “Closed” with an Opportunity Age of one or two days, we can minimize the bias these opportunities add as these same opportunities reflect 98% “Closed/Won” status. In sum, including this data will not increase the variance the covariates explain for its response variable.

Applicable treatments for this case:

  • Normalize the Opportunity Age variable
  • Omit opportunities that were recorded “Closed/Won” with Sales cycle of less than two days

While we solved the first half of the problem statement, we still need to correct the skewed distribution found in the Opportunity Age – particularly how it interacts with the response variable. To fully grasp the business process, we met with various stakeholders to map out the sales process in detail. The business teams outlined that they take around 40–55 days to close an opportunity; contrarily, the historical dataset reflects a bulk of opportunities closing in 3–20 days.

Opportunity Age
Einstein Discovery Insights displaying bucketed distribution of the variable Opportunity Age: buckets > 25 days fall below the average reference line (non-constant correlation)

Observing the chart above, any observation with an Opportunity Age > 25 days, would be penalized for its quantified age. Additionally, the Opportunity Age reflects a non-gaussian distribution. A data transformation (normalization) is needed to address the heavy tails of Opportunity Age (ranging 3-808 days). By doing this, we will scale the Opportunity Age to range between 40–450 days. This data transformation allows us to train our model on a larger breadth of the data set without omitting more data. As the aforementioned limits of normalized Opportunity Age are now 40-450, we can achieve an actionable yet statistically significant predictor.

Opp Age(Normalized) == Norm_min + (Actual_Val — OppAge_min) *(Norm_max -Norm_minVal ) / (OppAge_max — OppAge_min)

Norm_min → Minimum Normalized Value (40 in this case)

Norm_max → Maximum Normalized Value (450 in this case)

OppAge_min → Minimum Actual Opportunity Age in the Training sample

OppAge_max → Maximum Actual Opportunity Age in the Training sample

SAQL expression in Einstein Analytics Dataflow for the above formula:

{
“precision”: 18,
“saqlExpression”: “40 + (‘OpportunityAge’ — ‘Opp.Opp_Min_Age’) * 410/(‘OppMax.OpportunityAge’-’Opp.Opp_Min_Age’)”,
“name”: “N_OppAge1”,
“scale”: 0,
“label”: “Opportunity Age N(Pre-Adjusted)”,
“type”: “Numeric”
}

Opportunity Age 1
After normalization , Opportunity Age reflects a more noticeable pattern (constant correlation).

After normalizing the Opportunity Age variable, we begin to see constant variance for each numerical bucket.

Writeback:

Beginning the setup within the writeback / model deployment process, we need to map the normalized Opportunity Age variable to the actual Opportunity Age variable in the output dataset:

open opportunities
Normalized Age mapped to Actual Opportunity Age in output dataset (open opportunities*)

Maximum Age Outlier in Output Data:

There could be open opportunities with an age over 440 days (maximum Opportunity Age of training data < maximum Opportunity Age of output data). For these observations, we extract the maximum age from the output data and insert it into the training data grain – enabling the following calculation. This ensures the model doesn’t set limited bounds on the Opportunity Age for extreme outliers found in the output dataset:

{
“precision”: 18,
“saqlExpression”: “case when ‘N_OppAge1’ > 440 then ‘Opp_v.OpportunityAge’ else ‘N_OppAge1’ end”,
“name”: “N_OppAge”,
“scale”: 0,
“label”: “Opportunity Age N”,
“type”: “Numeric”
}

N_OppAge1 : Normalized Opportunity Age, calculated from previous calculation above

‘Opp_v.OpportunityAge’ : Maximum Opportunity Age in output dataset
open opportunities 1

The current performance metrics of our new model after removing bias

Hence, by following these inherent biases and executing steps like – dropping certain opportunities (0 to 2 days) from the training dataset, or normalizing the Opportunity Age to a 40–450 day range, we resolve the issues found in the problem statement. A great takeaway from this example: it is still on the Data Scientist and Data Engineers to identify and remove bias. AutoML-type algorithms are robust, efficient, and cost-friendly; however, it falls on the user to ensure bias is minimized and checked!

Related Blogs

feature-images-Einstein
My Einstein Discovery Toolkit
feature-images--robert
Snapshotting in Tableau CRM
Tableau-CRM-with-Dashboard-Reusable-Components-FE
Tableau CRM with Dashboard Reusable Components