Five Best Practices for the Wave Dataflow

Once you have moved past creating the first handful of Datasets you need for your Dashboards in Wave, you might begin to notice that your Dataflow takes longer to run.  When working with our customers, we have found long and complex Dataflows that take over an hour to run which can be frustrating for troubleshooting or getting frequent updates from your Salesforce org.

We wanted to share five tips for building smarter, faster Dataflows.

#1 – Limit Your Digests

When you build a new Dataset using the “Create Dataset” option in Wave, new JSON nodes are appended to your existing Dataflow without any reference or consideration to what is already in your Dataflow.  If you are not careful, then, over time, you may find yourself with a slow Dataflow due to a lot of redundancy.  You should also limit your sfdcDigest actions to include only the fields you truly need in your Wave Dataset (or that you need in other parts of your Dataflow).  Once you identify all of the fields you need and digest an Object a single time, you will want to change the source of various JSON nodes that used your superfluous digests.

#2 – Rename Those Nodes

It is not a lot of fun to unwind a spaghetti bowl full of JSON nodes that are named “101”, “102”, … , “153”, and so on.  Even if you are the only one in your organization that manages your Dataflow JSON, it is extremely helpful to give each node a name that makes sense:  “Extract_Opportunity”, “Augment_Opportunity_withAccount”, etc.  This practice is extra important if more than one person manages the Dataflow.  This process is not too hard once you have figured out what is going on with every node by using a “find and replace” feature of any good JSON editor.

#3 – Keep Your Grain Straight

This goes beyond a best practice since everything that happens in Wave requires an understanding of Dataset grain.  Keeping your grain straight in the Dataflow requires a solid understanding of the relationships between your data sources as well as smart logic to the order of operations in the Dataflow itself.  As a general rule, I always suggest digesting your base grain-level source data first and remembering to always “keep it on the left” as you go through your augment actions.  Keep an eye out for those occasions when you need to use a multi-value field operator such as adding Opportunity Team members to your Opportunity grain.

#4 – Backup Your Code

This is an important and basic requirement of any developer’s toolkit.  Unfortunately, Wave does not yet make this an easy task.  Remember to make a backup anytime that you download the JSON from the Data Monitor and keep it saved someplace safe.  This process can also help you de-bug new changes that you might be planning to add to the Dataflow.  Simply upload a partial version of the Dataflow that contains the parts you are testing so that run times are shorter while you are working out the kinks in the JSON.  Then you can bring it all back together at the end.  And don’t forget to save it again!

#5 – Use Replication

Replication is a newer feature that is now available in Wave that can add a lot of efficiency to your Dataflow process.  Replication separates the process of digesting Salesforce Objects from the rest of the actions of the Dataflow.  All the necessary Salesforce Objects are digesting on their own schedule via “replication” which creates temporary copies of the Objects.  To make this even more powerful, these replication copies work with Salesforce Objects on an incremental basis so your Dataflows run even faster because they are moving less data.

With these five best practices, you can work to make your Dataflows faster, smarter, and easier to manage.