In Part 1 of this series, we reviewed some of the planning elements while migrating from an on-premise data warehouse like Teradata to BigQuery. In this post, we will look at the various stages of execution which include schema migration from Teradata to BigQuery, data extraction from Teradata, and then finally migrate data to BigQuery.
Schema migration to BigQuery
Teradata and BigQuery have some differences in data types. This has to be kept in mind when converting schemas. Here are a few conversions that may be needed.
- BLOB, CLOB and BYTE types should be converted to BYTES in BigQuery
- DECIMAL, FLOAT, NUMBER, NUMERIC should be converted to FLOAT in BigQuery
- Teradata supports multiple “INTERVAL” types that can be converted to INTEGER in BigQuery – impact to your SQL and any user-defined functions should be assessed
- CHAR, VARCHAR and other character type fields can be converted to STRING in BigQuery
In addition to changing data types, it is often a good practice to review the schema itself and see if there are areas that can be improved. For example, a common practice in BigQuery is to leave the data in a flat, de-normalized structure. If the source schema follows the star schema design pattern then some amount of flattening it can improve BigQuery performance. Here’s a schema conversion tool we’ve written that automatically creates a BigQuery schema after converting data types and also de-normalizing a typical star schema.
Once the above tool is executed the de-normalized schema is available as shown below. The goal is to produce a wide, column-oriented table. Several transformations are applied to flatten the source schema. These transformations include left outer joins to join dimensions with facts, flattening nested and repeating fields, etc. BigQuery is optimized so that it works better with such flat data. Although it does support joins and a relational/star-schema it is usually a best practice to consider flattening the data given the columnar nature of a database like BigQuery.
Extracting data from Teradata
There are a few different options for importing data from Teradata into Hadoop:
Sqoop with Teradata JDBC driver
Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as Teradata into the Hadoop Distributed File System (HDFS). You can then transform the data in Hadoop MapReduce or run on Google Dataproc, and then export the data back into BigQuery.
Teradata Connector for Hadoop (TDCH)
The Teradata Connector for Hadoop (TDCH) is a map-reduce application that supports high-performance parallel bi-directional data movement between Teradata systems and various Hadoop ecosystem components, including BigQuery.
Moving data to BQ
The graphic below shows a high-level view of the various components involved. For the initial migration of data once data is extracted from Teradata, move the files over to GCS using GSUtil. Then use API to extract data from GCS and write to BigQuery again using API. Most of the heavy lifting in terms of processing is handed over to BigQuery.
Here’s an example of this transformation and ingestion into BigQuery.