Data Vault 2.0 Modeling – A popular choice for the Snowflake Data Platform

The Data Vault System of Business Intelligence modeling approach organizes data in a way that separates structural information, such as a table’s unique identifier or foreign key relationship, from its attributes.  The Data Vault method is a hybrid approach that incorporates the best of the third normal form and the star schema modeling methods. Snowflake supports Data Vault 2.0 and it is possible that you’ll achieve better results on Snowflake using the Data Vault modeling approach rather than other methods.  This is because Data Vault 2.0 can take advantage of Snowflake’s massively parallel processing compute clusters and optimized columnar storage format.  Moreover, with Snowflake, you don’t need to pre-plan partitioning or distribution keys.  You also don’t need to build indexes to obtain good performance results.

The Data Vault model is comprised of three basic table types; hub tables, link tables, and satellite tables.

  • Hub tables hold all unique business keys of a subject.  A hub also includes metadata describing the origin of the business key, known as the record source, which is used to track where and when the data originated.  As an example, HUB_EMPLOYEE may use an employee number to identify a unique employee
  • Link tables track all relationships between hubs essentially, describing a many-to-many relationship.  Links can link to other links.  Links are frequently used to deal with changes in data granularity which reduces the impact of adding a new business key to a linked Hub.
    As an example, LINK_EMPLOYEE_STORE would track the relationship between an employee and the store locations where they work
  • Satellite tables hold any attributes related to a link or hub and update them as they change.  This is similar to a Kimball Type II slowly changing dimension.  Whereas Hubs and Links form the structure of the data model, Satellites contain temporal and descriptive attributes.  These metadata attributes contain the date, the record became valid and the date it expired.  As such, Satellites provide powerful historical capabilities.  As an example, SAT_EMPLOYEE may include attributes like the employee’s first and last name, title, or hire date

The Data Vault model is particularly useful in certain situations.  Because Data Vault makes adding new attributes easy, it provides the most benefit when there are many data source systems or when there are constantly changing relationships.  The Data Vault approach also enables faster data loading because several tables can be loaded in parallel at the same time.  The ingestion process is also simplified due to having only inserts, which load more quickly than merges or upserts.  Data Vault is also a good choice when you need to easily track and audit your data.  Record sources are entered for every row, and load times are included as a part of the primary key in satellites.

To summarize, Data Vault is most useful if you need to load data quickly from multiple source systems, your data relationships change frequently, or you need to easily track and audit your data.  If your data is more straightforward, though, the Kimball dimensional model may be a better choice because implementing Data Vault would require an increased amount of business logic, where it isn’t needed.

It is important to understand that Data Vaults and data lakes are complementary, addressing different parts of the analytics requirements for organizations.  Data lakes provide a persistent staging area, meant to capture a superset of the data that arrives in too much volume, arrives too quickly, or arrives with too much structural variation for traditional relational systems to handle.  Data lakes are a great choice for data science activities.  For integrating the most useful data for business users, a subset of the data from the data lake can be fed to the Data Vault.

The Snowflake architecture, built from the ground up for the cloud, supports many workloads including data warehousing, data lake, data analytics, and data science workloads.

Contact us today at info@springml.com if you’d like to learn more about how to implement Data Vault 2.0 in your Snowflake instance.

Thought Leadership