Continuing our series “Leveraging the Power of Snowflake”, in this blog, we walk through how easy it is to load data from Google Cloud Storage into Snowflake. If you missed our first blog, check it out. We discuss Snowflake’s unique architecture and features like zero-copy data cloning, dynamic caching, and data sharing in Leveraging the Power of Snowflake: Key Benefits and Architecture. If you are all caught up, let’s get started!
Setting Up Snowflake
In order to get the most out of Snowflake, you can take advantage of Snowflake’s free trial and set up an account. When creating your account, select the Enterprise edition of Snowflake on Google Cloud Platform.
When you create a Snowflake account, you are probably asking if I have a GCP project, can I associate it with my Snowflake account? Is data in my GCP account accessible via Snowflake or vice versa? You are also probably wondering how to load data from a GCP project into Snowflake. Let’s dive right in to answer all those questions.
While creating an account, Snowflake creates a separate GCP (or whichever platform you pick) account and manages the cloud provider resources under the hood. You do not directly have access to the associated cloud provider account. Therefore, in order to move data from your GCP account to Snowflake provides the following options.
Options for Data Loading
Data can be loaded into Snowflake in various ways. The web interface provides a way to load data manually. Additionally, sql queries can be used to load data. The COPY INTO <table> features three ways of copying data
- Loading data from an Internal stage
- Loading data from a named external stage
- Loading data from an external location
Data from GCS can be loaded using one of the last two options. While both these options make use of Storage Integration, Snowflake recommends creating an external stage that references the bucket. Keep reading to follow SpringML’s recommended method.
Storage Integrations allow Snowflake to read/write data from/to Google Cloud Storage.
Integrations are named, first-class Snowflake objects that avoid the need for passing explicit cloud provider credentials such as secret keys or access tokens. Instead, they reference a Cloud Storage service account.
- An external stage references a storage integration object in its definition.
- Snowflake automatically associates the storage integration with a Cloud Storage service account created for your account. Snowflake creates a single service account that is referenced by all GCS storage integrations in your Snowflake account.
- Many external stage objects can reference different buckets and paths and use the same integration for authentication.
Creating External Storage
In order to load data from a GCS bucket into Snowflake, following steps are required:
1. Configure Storage Integration in Snowflake
CREATE STORAGE INTEGRATION can be used to create integration. Users with ACCOUNTADMIN role or role with global CREATE INTEGRATION privilege can execute this SQL command.
create or replace storage integration gcs_int
type = external_stage
storage_provider = gcs
enabled = true
storage_allowed_locations = (‘gcs://<bucket/path>’);
Use STORAGE_BLOCKED_LOCATIONS parameter to restrict access to buckets. STORAGE_ALLOWED_LOCATIONS can point to multiple buckets/paths as a comma-separated value.
*Note that the location is prefixed ‘gcs://’ instead of the typical ‘gs://’ as used on Google Cloud Platform. Follow the same pattern below when creating the External Stage via SQL or the web browser.
2. Retrieve Cloud Storage service account
Snowflake creates a service account with the above step. A project editor for Cloud Storage should grant access to this service account to access the bucket. In order to retrieve the service account, run the following command:
desc storage integration gcs_int;
3. Grant service account permissions in Google Cloud
The service account needs the following permissions based on the tasks needed to perform:
Data loading only
Data loading with purge option
Data loading and unloading
4. Create an External Stage
Since the integration was created with ACCOUNTADMIN role, usage permission should be granted before using the integration to create an external stage.
grant usage on integration gcs_int to role sysadmin;
External stages can be created via sql command or from the ui.
use schema mydb.public;
create stage my_gcs_stage
url = ‘gcs://<bucket/path>’
storage_integration = gcs_int
file_format = my_csv_format;
Loading Data using External Stage
Loading data requires a virtual warehouse. The context can be set manually in the Worksheet or by running USE <object> commands.
Running the list command returns the list of objects from the external stage
Multiple files can be loaded by using pattern matching
copy into mytable
Create the table with the required columns and run copy into command to load from gcs_stage
create or replace table mytable (col1 string, col2 string);
copy into mytable
file_format = (format_name = csv_format);
Monitoring Data Loads
Snowflake history web interface provides the status of each copy into command. Information Schema in each dataset also contains a Load History view to retrieve the list of data loaded into tables.
Data from multiple GCS buckets can be loaded into one Snowflake account. Snowflake creates a single service account that is referenced by all GCS storage integrations in your Snowflake account. This makes it easy to provide permissions to the service account in your GCP projects.
Bulk loading data from GCS into Snowflake can be accomplished easily by configuring storage integration and creating an external stage. Since all the steps can be implemented via sql, creating an automated script is also a possibility. But Snowflake offers a service called Snowpipe that allows you to load near-real time data. In our next blog in this series “Leveraging the Power of Snowflake”, we explore data loading using Snowpipe.