Data Integration process consists of a data pipeline at its core, which is a data processing layer that connects all the required sources and transforms the data to a single unified data platform. The main reason for integrating data into a unified data source is to remove the overhead of reaching out to multiple siloed data sources and providing a single source of clean data for reporting. It also reduces the overall time required to get the intended data for reports & dashboards used for data driven decisions.
Any complex process would benefit from a thought through strategy that includes all the lessons learnt in the form of ‘design patterns’. These patterns would form a guiding light for any new implementation. Without the guard rails of design patterns, the data integration process would be too chaotic and time consuming.
As the data integration market matures, we see more options to work on the data. In the current ecosystem, we see a wide range of upcoming tools that can be grouped under one of the two categories:
- ‘Low-Code‘ option in which data engineers can use a framework with connectors and build the integration pipeline using popular languages like python, java, etc.
- ‘No-Code‘ option is a GUI based tool for users who are not familiar with programming.
Google defines it as:
“Data integration is the process of bringing together data from different sources to gain a unified and more valuable view of it, so that your business can make faster and better decisions.
Data integration can consolidate all kinds of data—structured, unstructured, batch, and streaming—to do everything from basic querying of inventory databases to complex predictive analytics.”
Data Processing Frequency
Before jumping into the data integration strategy there is an important business requirement that has to be made clear that drives the whole discussion, which is how often the data needs to be integrated.
The intended data that needs to be integrated can be in one of the following 3 forms based on data processing frequency:
- Batch processing for delayed consolidation of data
Example: For Retail sales report data from different sources like E-commerce, POS, Mobile, Social, etc are integrated into the DW using batch processing and the sales reports are generated.
- Micro Batch for near real-time data
Example: Building a near real-time demand forecasting model using the user click events on web and mobile and building the pipeline using tools like Spark streaming to process data in small and frequent batches and Vertex AI forecasting model can be used to derive predictions
- Streaming for real-time data
Example: Online fraud detection is a major use case for streaming data analysis as they are time sensitive. The window of time to integrate, analyze and flag for fraud is in milliseconds. So streaming data is the only option here that uses messaging queues like Pub/Sub, cloud data warehouses like Big Query or Redshift to efficiently process and ML models to analyze fraud.
Common Integration design patterns:
Currently, batch processing constitutes the majority of use cases. That being said, the Micro batch and Streaming data application ecosystem has improved a lot and is gaining popularity. Use cases like fraud detection, real-time inventory management, stock trading, gaming are a few examples. The rise in new technologies like mobile phones, IoT devices, & Metaverse platforms backed by super-fast hardware has started producing more frequent data, which in turn needs to be captured and analyzed quickly.
- API – Traditional data integration process was code-heavy and involved a code version of all 3 phases – Extract, Transform & Load. Here Application Programming Interfaces (APIs) became standard contracts between source systems and custom programs used to build the pipeline. This approach was very tedious and time-consuming. Adding a new data source or updating an existing data source became a nightmare as it took a lot of coordination between teams and systems.
- ETL – As the ecosystem evolved, many tools replaced the code-heavy ETL with packaged solutions like Informatica, Ab Initio, Oracle Warehouse Builder, SAP Business objects, etc. Extract, Transform, and Load creates three stages of data integration where the source data gets cleaned up and transformed before being loaded into the Data Warehouse for advanced analytics.
Problem: A Fintech company providing home mortgage would like to know the estimated home value
Solution: Home prices depend on a few basic parameters like location, neighborhood, house size, age, taxes, etc., and also depends on a few external factors like similar home sales, assigned school’s rating, demographic changes, etc.
The data pipeline uses batch data from sources like county for property price and tax, MLS for similar home sales, School rating systems, etc. In this case, data can be consolidated periodically and the estimates derived as we see new updates.
3. ELT – As the technology evolved, cloud computing became popular paving the way for Data warehouses on the cloud. The cloud version of the DW had more processing speed and storage capacity. So this enabled it to change the sequence in operation that resulted in direct loading of the extracted data to the warehouse and then running the transformations on top of it.
Problem: A retailer trying to understand the customer behavior to build personalization
Solution: Building a customer data platform involves data from multiple sources as listed below:
- Transactional data from ecommerce website
- Transactional data from POS
- Transactional data from Social
- Transactional data from Mobile
- Analytics data
- Marketing vendor data
ELT process uses cloud data warehouses like Bigquery or Redshift that have huge processing power and can handle petabyte-scale data.
Data will be processed in logically separated zones that include:
- The landing zone which holds on to the raw data from all sources
- The staging zone usually includes deduplicated and cleaned data through transformations
- The production zone will only contain the subset of data from the staging zone that is used to analyze a particular use case. In this example,the customer journey data reveals how a customer would have moved from mobile to desktop before making a purchase, or found a product on social media and walked into a store to complete the purchase
Identifying customer behavior with multiple data sources would provide an end-to-end perspective to understand and create a cohort for personalization. In this example, we have the right data points to create a campaign to provide discounts for buying on mobile or social platforms that can cut the number of hops a customer goes through to complete a transaction. This personalized campaign would increase conversion and the success rate of the campaign.
4. Messaging Queue (Pub/Sub & Kafka) – With the popularity of the Mobile, IoT, Gaming sectors, the applications started creating more frequent bursts of data. This also created a need to get insights into the data quickly and more frequently. This disrupted the data stack. The need for real-time and near real-time insights was answered by messaging queue-based tools on the cloud like Pub/Sub & Kafka. This created streaming data services like Spark, Beam, that can read real-time data on messaging queues, transform, and load it onto cloud data warehouses like BigQuery & Redshift.
Problem: A pharmaceutical company launched a new vaccine and would like to understand its adoption and impact.
Solution: Streaming ETL fits well to solve the real-time analytics requirements. The time taken to ingest the different data points is very important. So a streaming data pipeline built on a messaging queue can be used, which is always open for incoming data. As soon as the input data lines up, Sparks streaming batch engine picks up in micro-batches to process/transform and load to the BigQuery.
5. iPaaS – Integration Platform as a service is a next-generation integration platform that is built for application and data integration using AI/ML & Business Process Automation concepts to provide seamless integration between multiple hybrid and multi-cloud environments. It provides a series of services packaged to support Application and Data Integrations using both API and No-Code solutions.
Problem: In the Travel and Hospitality domain an airline giant is working on a large-scale digital transformation to rebrand itself from a manufacturing to a technology company.
Solution: Using Mulesoft Anypoint platform to build the digital transformation efforts that can bring in the required architectural building blocks for application and data integration.
This API first approach builds a loosely coupled application layer, and GUI-based API connectors form a no-code solution that helps in connecting APIs between subsystems across multiple cloud platforms and on-prem implementations. It supports constructing a data pipeline to build a consolidated view for visualization.
6. Reverse ETL – An honorary mention to the Reverse ETL here, as we keep hearing this more frequently now. It is an emerging data transformation technology that uses consolidated data in data warehouses to be transformed and injected into third-party systems like ERP, CDP, CRM, etc to derive better context from the consolidated data in the DW.
Problem: Enriching Salesforce CRM system with all marketing campaign data to improve the customer experience.
Solution: More often Data warehouse is the single source of truth (SSOT) that holds data from multiple marketing vendor systems like FB Manager, Google Ads, TikTok Marketing, Tealium Audiencestream, etc. Data from each data source brings its own dimension. So the primary use case for consolidating the data in the data warehouse is to build the data visualization required for actionable reports. This data can also be made actionable by enriching the data on operational systems like Salesforce CRM. Inducing the customer behavior tracked by FB, TikTok, Google, etc., can provide valuable information to build better customer relations.
Low-Code or No-Code (Drag-n-Drop) Tools
“Buy v/s Build” is one of the most essential questions a business user would ask to evaluate the ROI on any project. Although for this problem, the uniqueness of the requirements created by the business problems along with the team’s availability would provide a clear direction towards buying off-the-shelf products with some configuration changes or building from scratch.
In the cloud era, as the services gain maturity, this question is transformed into a “Low-code v/s No-code” question. Most of the services on the cloud are built to minimize the overhead to set up a tool or remove the boilerplate code that can be reused as pre-built components. Here are a couple of examples of data integration tools.
Google Dataflow is built on top of the Apache Beam project that can process both batch and stream data. This creates a concept of ‘Dataflow templates’ that disconnects the development and execution parts of the pipeline. Pre-built or custom templates would provide the required processing logic that would bring down the total effort required to work on a pipeline. This Low-code option is faster and more reliable compared to performing the same task on Apache beam’s non-template approach.
Taking a step ahead of the low-code option, the no-code option would provide all required components to build a pipeline just by accessing it on a browser.
Google Data Fusion is a no-code solution for data integration built on top of the Cask Data Application Platform (CDAP). Its pre-built connectors can be used as a plug-in to the tool that helps in tying things together. It provides a simple drag and drop feature to create and execute a data pipeline. It also comes with hooks to on-prem and other clouds for seamless integration and creating a hybrid or multi-cloud solution.
Data integration tools have evolved through multiple iterations and are seeing some fundamental changes due to cloud adaptability. Public cloud platforms like GCP, AWS & Azure have made it their number one priority to invest in the data cloud. As a result of which we are seeing noticeable changes in user adoption.
Gartner has acknowledged in its latest magic quadrant for Data integration tools that the latest drift towards hybrid and multi-cloud environments has created renewed momentum. As a result of these changes, hyperscalers are on the cusp of replacing traditional tools.
- Informatica Cloud Integration is used to perform cloud-native ETL and ELT solutions. It uses a Hadoop cluster to run blaze, spark, or hive engines
- IBM Cloud Pak for data is a complete AI-based data integration solution built on the Openshift container platform
- SAP Integration suite is an iPaaS solution to integrate on-prem and cloud-based applications managed by SAP. It also provides integration with the SAP HANA platform
- Oracle Goldengate platform is a log-based Change data capture and replication software built on the Oracle database. It has a strong product portfolio that supports data fabric design
- Snaplogic is a fast-growing iPaaS solution used to integrate On-prem applications, SaaS applications, and Cloud data sources.
There are few new tools like Striim, Fivetran, etc. are catching up to the competition and defining their own space in the data world.
Google Cloud Ecosystem
- Data Fusion is a fully managed no-code integration tool that only provides GUI tools to create data pipelines built on the Cask Data Application Platform (CDAP)
- DataFlow is an integration framework built on top of the Apache Beam that supports both Batch and Stream data processing
- DataProc is a fully managed service running Apache Hadoop, Spark, Flink, Presto, etc.
- BigQuery is a fully managed petabyte-scale analytics data warehouse system that supports SQL-like query language and the most efficient algorithms to access huge amounts of data very quickly. It supports Machine Learning through SQL based BigQuery ML
Datastream is a serverless Change data capture tool that can be used for streaming / low latency data from Oracle or Mysql databases to many services like Bigquery, Dataflow, Data Fusion, etc.
“Data Integration” process plays a big role in managing data. It is a key step that has to be well defined in the strategy phase before starting any data initiatives like Data Warehouse, Customer Data Platform, Customer 360, etc. Knowing the right design pattern to use for the right scenarios would make things much simpler.