When working with techniques in CRM Analytics (CRMA, formerly Tableau CRM), have you ever felt confused regarding which join to use? Did you ever wonder if you should use a Lookup instead of a Left Join? If so, you are absolutely in the right place.
If you want to get meaningful and accurate data, it is necessary to use the correct type of join. This blog will help remove any confusion regarding joins and help you understand the different situations in which these joins can be used.
CRMA techniques are used for various kinds of data transformations and operations such as bucketing data, adding formula fields, filtering data, and cleaning data prior to outputting to a target dataset. We can also combine related columns of the data from multiple datasets or connected objects using joins. There are various join options in CRM Analytics, such as:
- Left Join
- Right Join
- Inner Join
- Full Outer Join
- Cross Join
Let’s deep dive
First of all, create a base dataset and then use input nodes to import multiple datasets or objects.
Once this is done, select the Join and choose the type of join as needed.
Note- If there are A and B input nodes and you want to make ‘B’ the left and driving node, then drag the + icon from the A node to B.
We use a Lookup, when we want to return all data from the left stream and only matching rows from the right stream. In order to ensure that the grain of the result doesn’t change, the Lookup outputs one row for each row in the left data stream by default. However, there is an option to set the Lookup to return all the matching rows.
For example, below in this Employee data we want to build a dashboard that segments the records by Employee information.
If the Lookup is configured to return a single match when multiple matches are found, the Lookup produces the following results based on the matching keys.
All rows from the left, including rows without a match, are included in the result. Although Employee Id 1042 has two matches on the right, the Lookup returns only the first matching record. Also, because Employee Id 4567 doesn’t have a match, the Employee Id (right key) and Department columns are null, and Annual Package is 0 for that record (unmatched dimensions are set to null and unmatched measures are set to 0)
We can also configure the Lookup to return all matching values when multiple matches are found.
Multiple-match Lookup will give the same results as the single-match Lookup, except for the left rows that have multiple matches. Unlike a single-match Lookup, a multiple-match Lookup returns all matches and combines them.
For dimension columns, it generates a multivalue column with all dimension values, such as it did for the Department Column for Employee Id 1042.
For measure columns, it adds the measure values from all matched records. For Employee Id 1042, the Annual Package column is 2200000 (1000000 + 1200000).
Please note that Preview tab doesn’t show all values in a multivalue column. Instead, it shows the first value only, as shown in the below image.
Like a Lookup, Left Join outputs all rows from the left dataset and only matching rows from the right dataset. Unlike a Lookup, it includes all matched rows in the target when multiple rows match.
Consider the following two data streams that feed the target dataset.
After performing the Left Join based on the matching keys, the join produces the following output.
All rows from the left, including those without a match, and those with multiple matches, are included in the target. Since Employee Id 1042 has two rows that match on the right, the target contains a separate record for each Department. Also, because Employee Id 4567 doesn’t have a match, the Employee Id (right key) and Department columns are null for that record.
Right Joins are used when we want to include all rows from the right dataset and matching rows from the left dataset. It includes all the matched rows in the target when multiple rows match.
After performing the Right Join on our example data,the join will produce the following output.
All rows from the right, including those without a match, and those with multiple matches, are included in the target. Because Employee Id (Right Key) 5062 has two matches on the left (Harry and Brian), the target contains a separate record for each of them. Also, because Employee Id (right key) 7568 doesn’t have a match, the Employee Id (left key) and Employee Name columns are null for that record.
If duplicate records contain measures (as shown for Employee Id 1042 in the following screenshot), take care to not double count the measures when aggregating. To prevent duplicate records, use a Lookup instead of a join or address the duplication prior to joining the data.
An Inner Join includes only matching rows from the left and right based on common fields. It includes all the matched rows in the target when multiple rows match.
After performing the Inner Join on our example data , the join will produce the following output.
All rows that have a match are included in the target. Because of multiple matches, you see two records with Employee Id 5062 (Harry and Brian) and two records with Employee Id 1042 (Analytics and Development). You can also notice that Employee Id 7568 and Employee Id 4567 are excluded from the target because neither appeared in both the left and right tables and therefore didn’t have a match.
Full Outer Join
Full Outer Join returns all records whether there is a match or not between the datasets.
After performing the Full Outer Join on our input example data , the join will produce the following output.
Regardless of whether they have matches, rows are included in the target. Because of multiple matches, you see two records with Employee Id 5062 (Harry and Brian) and two records with Employee Id 1042 (Analytics and Development). Employee Id 7568 and Employee Id 4567 are also included despite not having a match.
By including unmatched left records, we now know that we have to find out Sarah’s department to complete this dataset and that no employee is in the Data Science team.
This is the latest addition to Data Prep’s join types. In CRMA, a Cross Join combines unrelated records and includes all rows from the left and right streams. Unlike other joins that use keys to find matches, the Cross Join pairs every row from one dataset with every row of another dataset (known as the Cartesian Product).
When designing a Cross Join, use the larger dataset first (left) and then join the smaller dataset (right). There are some limitations of Cross Join, because in CRMA they can result in very large datasets. Keep these points in mind when designing and testing your Cross Join:
- Only one Cross Join per dataset
- Cross Join results maximum of 10 million
- Left dataset column maximum of 8 and row maximum of 100,000
- Right dataset column maximum of 8 and row maximum of 1,000
- Doesn’t support multi-value columns
Applying a Cross Join in our example data will produce the following output:
It has paired every row from one dataset with every row of another dataset, creating a cartesian product.
You should have a particular use case in mind prior to using a Cross Join and consider the implications. For example, Company XYZ sells products in India and the USA. To assist with stock tracking, we can use a Cross Join to combine the product and store datasets to find all the possible combinations of product and store. Cross Join will produce a dataset with a row for each product in both India and USA stores.
Considerations When Using Joins
- Don’t double count measures when aggregating records from a join that duplicates row values; try to use Lookup
- Refrain from using joins when the join keys have a many-to-many relationship. This can make the target dataset become significantly larger than the input data streams