In this three-part series we will walk you through what the journey to migrate from an on-premise or cloud data-warehouse to BigQuery looks like. This first part deals with planning. The second part will discuss execution and the third part will outline post-migration steps such as validation and monitoring.
Undertaking a data warehouse migration is a complex process. Proper planning will help you minimize risks and will ensure minimal disruption to your end users. Here are a few things to think about – talking through the following with stakeholders (end users, infrastructure and network operations, database administrators) will help identify the requirements.
|Expected volume of data – break it down by number of rows, columns per day, per week or per month|
|Gather details of source schema. The target schema represents how your end users will want to see the data in BigQuery. Implementing this schema usually requires a compromise between ease of use (de-normalized) versus potentially efficient data access. Provide thoughts on types of users accessing data.|
|Is there a logical data model that represents the data presentation requirements? This could be models from your BI tools or analysis of reports and dashboards in use today. What is the grain level, dimensions, measures, aggregations, etc|
|Is there a need for federated data sources? Is there a need to support changing source data formats/schema? Sometimes source data systems change their data format very frequently. As an example, consider a web application log file where new data elements are frequently added (or removed) as application functionality is changed.|
Initial Base Load
|How large is initial load? How many tables, views, users? Which source systems will data need to be pulled from and what are the dependencies?|
|Backfill data – how far back do you want to go to backfill data? Think about any regulatory or legal requirements. Is data load bulk or incremental? If incremental, is the same incremental frequency (e.g. daily changes) to be maintained all throughout history? Or can it start at a later cut-off date?|
|Requirement to clean, de-duplicate or standardize data (“CA”, “CALIF”, “California” all become “CA”)|
Ongoing Data Ingestion
|Ingestion techniques per data source – multiple options available – dataflow, direct BQ load job, streaming API|
|Is source data mutable? BigQuery is append-only, and all writes are treated as immutable–you can’t update a row once it’s been set.|
|What is the tolerance around accurate data match between source and target systems?|
|Will raw data be stored or will data need to be transformed/aggregated?|
|Who are the consumers of this data – BI, ETL, ML, etc?|
|What type of ETL jobs will be run on these tables? How many and what are their dependencies?|
|Will disparate teams access data in BQ – how do you envision granularity of security/roles to be?|
|Should a catalog of datasets and tables be provided to users?|
In the next post, we will discuss the execution phase which will deal with the various tools and techniques that can be used in the migration process.