Legions of data fanatics streamed into San Francisco, in person and online to enjoy mounds of fresh powder in the form of essential enhancements and critical features courtesy of Snowflake (thank you for indulging in my snow-themed introduction). Snowday took place on Monday, November 7th this year, wrapping up the San Francisco portion of Snowflake’s 2022 Data Cloud World Tour.
Cross-cloud Functionality – Snowgrid
Organizations across the globe require the power and flexibility to access and integrate their data, across borders of every type, while maintaining data governance policies across a patchwork of regulatory requirements. Snowgrid was built so that organizations can operate within an interconnected cloud layer.
Functionally, what this means is that each Snowflake region is not a different instantiation of the data system. Snowflake runs across the three major cloud providers: AWS, Azure, and Google Cloud in the same way it runs across the 35+ cloud regions it supports. Snowgrid enables organizations to spread all aspects of data management, (e.g. cost, compliance, and risk) across both clouds and regions, practically eliminates any chance of service interruption. Let’s take a closer look at three exciting Snowgrid features.
- Cross-Cloud Business Continuity can be thought of as a virtual global data center. With Snowgrid, organizations can replicate and synchronize accounts, databases, and even data pipelines while users are able to replicate both streams and tasks.
- Cross-Cloud Data Governance empowers organizations to consistently enforce governance policy. One such method displayed is Tag-based data masking, which enables Data Administrators to harness the power and efficiency of automation across large data sets. Data Administrators can now create a data masking policy and assign the policy to a Tag. Going forward, any columns marked with the Tag are also assigned the data masking policy!
- Cross-Cloud Collaboration means organizations using Snowgrid can share live data without any ETL, across all cloud platforms. Snowflake introduced an additional enhancement feature, Listing discovery controls. Listings provide organizations with automatic fulfillment and usage metrics, regardless of which cloud or region the Client’s account exists. Organizations are able to discover and access far more, such as package-rich metadata, including detailed descriptions, examples, and use cases while preserving data privacy through robust data governance!
To date, there are tens of thousands of applications created to handle an organization’s sometimes unique, but often redundant or overlapping needs created due to siloed data or other function roadblocks. Both this session, as well as the Data Science and Machine Learning session featured the numerous capabilities available through Snowpark for Python, as well as additional enhancements and capabilities, such as Snowflake’s acquisition of Streamlit, UniStore, and Snowpark Optimized Warehouses.
Snowpark for Python, currently in public preview across all three supported clouds, includes a highly secure Python sandbox, providing developers with Snowflake’s well known benefits of scalability, elasticity, security, and compliance while eliminating data security and compliance roadblocks. Snowpark for Python’s partnership with Anaconda provides a curated set of libraries that make it easier to maintain governance consistency and programming confidence.
When Streamlit was acquired by Snowflake in March 2022, it was quickly recognized as a powerful tool to build applications through Streamlit’s open-source framework using Python. Developers can easily create interactive applications to illustrate the meaning and impact of their data and Machine Learning models, all within the Snowflake environment.
Unistore is introduced as a new workload, which resolves the problem of executing transactional data analysis. Unistore combines row-based and column-based storage which allows developers to write applications that can access the data in the format that is most valuable for them at that time.
Unistore is powered by hybrid tables, (currently in preview), a new type of table with fine-grained read and writes, and great analytical query performance.
A hybrid table allows fast single-row transactions to be performed on data quickly, according to Snowflake. Notably, a hybrid table is simply another Snowflake table type. As such, it integrates seamlessly with an organization’s existing data and tables.
Snowflake is also releasing Snowpark Optimized Warehouses (public preview in AWS). Snowpark Optimized Warehouses empower Python developers to run large-scale Machine Learning training and other memory-intensive operations directly in Snowflake.
Data Science & Machine Learning – Snowpark
Recent innovations related to data science and Machine Learning are found in Snowpark for Python and Snowpark Optimized Warehouses.
Many development teams are all too familiar with the ‘Development to Production Chasm’ produced by: siloed teams and tools, separate infrastructures, and numerous copies of data. All of this results in an enormous volume of undifferentiated and duplicate work within an environment (e.g. developing training models) that uses large amounts of data across numerous sources.
These issues motivated Snowflake to build Snowpark for Python which gives data engineering operations, data developers, and data science teams the ability to generate insights without complex infrastructure management for separate languages.
Snowpark for Python provides access to popular libraries, often available for use inside Snowflake, including Prophet, PyNomaly, PyPDF2 h3, Gensim, datasketch, email-validator, and tzdata in addition to Python runtimes 3.9+ support. Data Scientists can perform end-to-end Machine Learning inside Snowflake and utilize CI/CD integrations to bring Snowpark into Git Flow.
Snowpark Optimized Warehouses boast significant memory and performance improvements. Each node of a Snowpark Optimized Warehouse provides 16x the memory and 10x the cache when compared to a standard warehouse. These significant enhancements enable Data Scientists and other data teams to utilize the Machine Learning training capabilities of Snowflake.
Snowpark Optimized Warehouses include all of the features and capabilities of a standard virtual warehouse, but can streamline Machine Learning pipelines based on its compute infrastructure (enhanced memory and cache) that effortlessly executes memory-intensive transactions (e.g. statistical analysis) as well as model training, and inference operations, all within Snowflake and at scale!
Data Engineering – Snowpipe Streaming
Before presenting the exciting details of Snowpipe Streaming, let’s take a step back and recall a useful definition of streaming:
“The practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events. Event streaming ensures a continuous flow and interpretation of data.”
Snowpipe Streaming is an enhanced version of Snowpipe that allows for close to real-time data ingestion (with very little configuration required) at a row-by-row level. To illustrate this real time data ingestion using numbers, consider the following. Snowpipe has an average latency period of 25 seconds. Snowpipe Streaming has an average latency period of 1-2 seconds!
Snowpipe Streaming is available across different languages and formats. All performance is supported by elastic compute and there is no need to think about scaling infrastructure while considering data influx and outflux.
Snowflake highlighted Dynamic Tables (private preview), formerly referenced as Materialized Tables, can be understood as declarative table pipelines that bridge the disconnect between streaming and batch pipelines. They automate incremental processing and are native to Snowflake, and thus can be shared across all Snowflake accounts securely and in accordance with governance policies.
This wraps up the highlights from Snowday 2022. Check back soon to see my posts after attending Snowflake’s Build: The Data Cloud Dev Summit from November 15 – 16, 2022.