Snowflake is the newest competitor providing Elastic Data warehouse. This blog is an attempt to provide a simple and high-level understanding of Snowflake and its architecture. Key features are highlighted but not discussed in detail in this blog.
Snowflake’s documentation is the best source for additional information. The best way to learn is to get your hands dirty, Snowflake provides a free 30-day trial with $400 worth of usage for hands-on learning.
What is Snowflake
Snowflake is a comprehensive data platform provided as a Software-as-a-Service (SaaS). Snowflake Cloud Data Platform can support your data warehouses, data lakes, data exchanges, data applications, and data science workloads.
It runs completely on a Cloud Infrastructure. Unlike traditional single-cluster architectures, Snowflake has a dynamic and highly scalable multi-cluster architecture provided as a service with per-second pricing. Snowflake’s ability to scale instantaneously differentiates itself from a traditional data warehouse.
What Snowflake isn’t
Snowflake is not a relational database. It is not built on any existing database technology or “big data” software platforms such as Hadoop. It is not a packaged software offering that can be installed and run on a private cloud infrastructure (on-premises or hosted).
Snowflake’s uniqueness lies in its independent and scalable storage and compute layers. It is a hybrid of traditional shared-disk and shared-nothing architecture.
Snowflake uses a central storage layer for persisting data that is accessible to all compute nodes. Snowflake also processes queries using Massively Parallel Processing (MPP) clusters where each node stores a portion of the entire data locally.
This combination of shared-disk and shared-nothing architecture offers simplicity with performance and scalability.
Diagram is courtesy of benstopford.com
Snowflake’s architecture consists of a:
- Cloud Services Layer
- Query Processing Layer
- Database Storage Layer
Cloud Services Layer
This layer is referred to as the “Brain” of the system. It is a collection of independent services designed specifically for scalability and high availability. Each of these services is managed completely and transparently by Snowflake while maintaining availability during upgrades and patches.
With exception of persisted metadata, all services are stateless. All persistence is supported by robust, scalable, transactional key-value data store that is accessed through an abstract mapping layer.
Services managed in this layer include:
- Infrastructure management
- Metadata management
- Query parsing and optimization
- Access control
Query Processing Layer
This layer is referred to as the “Muscle” of the system. It provides the horsepower that drives the actual query execution across elastic clusters of virtual machines.
Snowflake uses the term “Virtual Warehouse” which is essentially a MPP compute cluster to execute queries. These clusters can scale to demand essentially accessing the same underlying data. They run independently and without contention, enabling heavy queries and operations to run simultaneously.
Warehouses can be set to automatically suspend in order not to incur costs during downtime. They can also be set to resume automatically when a statement that requires the warehouse is submitted.
Scaling up vs. Scaling out
Scaling up increases the compute power of the existing warehouse. Snowflake defines virtual warehouses in T-shirt sizes, X-Small, Small, Medium and so on. Subsequent sizes provide double the compute of the previous size. Scaling up can be done without stopping the warehouse and future queries take advantage of the additional capacity.
Scaling out assists in executing concurrent queries on a warehouse. Snowflake offers Standard and Economic Scaling policies. Standard policy focuses on minimizing queuing where as an Economic policy looks at fully utilizing the current cluster before adding an additional cluster.
Workloads can be separated to use separate warehouses. Warehouses can access the same underlying data without competing for resources.
Database Storage Layer
Snowflake organizes the data into multiple micro partitions that are internally optimized and compressed. It uses a columnar format to store and manages all aspects of the data like file size, compression, metadata, statistics etc. Data is encrypted by default in Snowflake.
The data objects are only available via SQL query operations run via Snowflake. Snowflake is built to be a complete SQL Database and has its own query tool, supports role-based security, multi-statement transactions, full DML, windowing functions and everything else expected in a SQL database.
Data is stored in the cloud storage and works as a shared-disk model thereby providing simplicity in data management.
Snowflake’s architecture allows quick consolidation of diverse data onto one platform. Semi-structured data can be loaded as a VARIANT data type which enables querying JSON in a fully relational manner.
Snowflake provides different ways to connect to the service:
- A web-based user interface
- Command line client called SnowSQL
- ODBC and JDBC drivers can be used from other applications
- Native Python and Spark connectors to develop applications connecting to Snowflake
- Third party connectors
Since Snowflake runs completely on the cloud infrastructure, a Snowflake account can be created on the following cloud providers’ platform:
- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- Microsoft Azure (Azure)
A single account can be hosted on only one cloud platform in a single region. Regions available are based on the cloud platform selected.
Currently, Snowflake offers four editions to choose from. An account can be of one edition type only. Each edition provides edition-specific features and level of service.
- Standard Edition – introductory level offering, providing full, unlimited access to all of Snowflake’s standard features.
- Enterprise Edition – Standard Edition + additional features designed specifically for the needs of large-scale enterprises and organizations.
- Business Critical Edition or Enterprise for Sensitive Data (ESD) – Enterprise Edition + enhanced security and data protection, particularly for PHI data that must comply with HIPAA and HITRUST CSF regulations.
- Virtual Private Snowflake (VPS) – Business Critical Edition + completely separate Snowflake environment, isolated from all other Snowflake accounts.
Snowflake offers several key features like Security, Data Protection, Caching, Data cloning, Data sharing, ACID compliance etc making it a truly unique cloud platform for storing and retrieving data. Snowflake is also committed to continual innovation to deliver improvements in the form of new features, enhancements and fixes.
Snowflake’s unique architecture and features like zero-copy data cloning, dynamic caching, and data sharing provide tremendous flexibility and scalability making it an unrivaled competitor in the Elastic Data warehousing space.