When it comes to replicating your Harvest data to Google BigQuery, conventional ETL is no longer the only game in town. Writing ETL code requires significant investments of time and expertise; this is precisely why some data people semi-affectionately abbreviate it as “Extremely Tough to Load”. Newer approaches to data ingestion like using MuleSoft deliver faster implementation than traditional ETL and helps businesses implement business intelligence quickly. In this blog, we will show you how we extracted the data from Harvest and loaded it into Google BigQuery using MuleSoft in comparison to the conventional way of doing it. Let’s get started.
ETL – MuleSoft vs. other methods
To begin with, we decided to use Python script to migrate the data into BigQuery. Harvest supports REST API to access the data in a Harvest account. So, we fetched the data from Harvest using the REST API in the form of a JSON object and then:
- Cropped the required fields and converted that JSON object into the Pandas DataFrame (as it would be easier to push it into the BigQuery using the “pandas_gbq” library)
- Pushed the DataFrame into BigQuery
- Plotted the dashboards on Data Studio using BigQuery as a data source
The process took about 3-4 weeks, and we also encountered some issues during the migration.
We implemented the same thing using MuleSoft to see how quick and easy the process could get, and it didn’t let us down; the implementation itself was simple and took only 3-4 days. MuleSoft is modular, and all the components can be reused by any person in your organization (in private MuleSoft Exchange), so if you decide to implement any new thing based on Harvest or BigQuery you can reuse these components and save plenty of time and effort trying to reinvent the wheel.
So, coming to how we did it? – Implementation strategy in MuleSoft:
We pulled the Harvest data using Request module which is under HTTP module in the MuleSoft Exchange. As there is no official BigQuery connector in MuleSoft Exchange, the option is to either create one or write custom Java code. We chose to write the custom Java code using the Java module in MuleSoft and call the functions as needed.
The Java module has three components; namely, New, Invoke, Invoke static:
- New component: To instantiate the new object of a class and can be configured as needed.
We provided the class and constructor name to add all the initial configuration set up code in the constructor and used this class object in the subsequent function call and throughout the flow as well, as this Java module follows a singleton design pattern.
- Invoke Component: To call any function inside the class object.
You can invoke an unlimited number of invoke components as needed and then finally transform the output using a transform component and be done with it. To configure the Invoke module, you can pass the function arguments using the object structure but remember to use arg0, arg1 as a key for the arguments. And yes, you can pass a variable or any custom value as a parameter to the function you are calling.
However we used a constructor to set up the BigQuery authorization configurations. In the next step, we used several invoke components to call the BigQuery API and pushed the data into the BigQuery dataset.
Once everything is working on the local Mule runtime, we just pushed the code to CloudHub, from Anypoint Studio seamlessly.
VOILA! There you have it.
SpringML provides deep expertise in data integration and big data management using MuleSoft. With dozens of customer engagements leveraging MuleSoft as the key to unlocking data, SpringML has seen great success in delivering meaningful insights to the business stakeholders. AI is a journey, and our experts will meet you wherever you are. Contact us to learn more.