We are quickly becoming big fans of Google Cloud Dataflow. See our previous posts on this topic here and here. We are excited to continue using this product and creating analytics solutions for our customers. In this post we are going to describe how RFM technique can be applied on large data sets – think 100’s of millions of rows. This may not be possible using traditional desktop or single (or limited) server based programming frameworks. Google Cloud Dataflow offers a scalable environment to process and transform large datasets. Here’s a quick summary of how one can go about leveraging Dataflow to implement RFM to segregate customers.
- Connect and extract data from source systems. This can be via the existing “connectors” that Dataflow provides e.g. BigQuery, File, etc or via custom sources one can write.
- Write your Dataflow pipeline job. Here are the high level steps.
- Pipeline – a pipeline is a dataflow job. The job to do RFM analysis consumes order transactional data with the main attributes such as customer ID, order date and order amount. A pipeline creates PCollections to work on data. For example, the data from a source system is read into a PCollection.
- PTransform – the first transform step takes one or more PCollections as inputs, applies the transformation or enrichment logic, and produces an output PCollection. For RFM Analysis, the GroupByKey core transform provided in the SDK should be implemented to group transactions by customer and a time slice such as monthly.
- Another PTransform needs to be written to find the max and min of each of the R, F and M parameters. Again, Dataflow provides statistical functions called Min and Max to do the job here.
- Once we have the max and min, we can divide the parameters into quintiles with top 20% coded as the number 5, the next 20% 4 and so on.
- Once done, the RFM cells along with the customer ID can be written to a data visualization system such as Salesforce Wave.
This is a high level post. We’re working on prototyping this solution and will follow up with more details on the pipeline job soon.