Leveraging the Power of Snowflake: Bulk Loading from Google Cloud Storage

Continuing our series “Leveraging the Power of Snowflake”, in this blog, we walk through how easy it is to load data from Google Cloud Storage into Snowflake.  If you missed our first blog, check it out.  We discuss Snowflake’s unique architecture and features like zero-copy data cloning, dynamic caching, and data sharing in Leveraging the Power of Snowflake: Key Benefits and Architecture.  If you are all caught up, let’s get started!

Setting Up Snowflake

In order to get the most out of Snowflake, you can take advantage of Snowflake’s free trial and set up an account. When creating your account, select the Enterprise edition of Snowflake on Google Cloud Platform.

When you create a Snowflake account, you are probably asking if I have a GCP project, can I associate it with my Snowflake account? Is data in my GCP account accessible via Snowflake or vice versa? You are also probably wondering how to load data from a GCP project into Snowflake. Let’s dive right in to answer all those questions.

While creating an account, Snowflake creates a separate GCP (or whichever platform you pick) account and manages the cloud provider resources under the hood. You do not directly have access to the associated cloud provider account. Therefore, in order to move data from your GCP account to Snowflake provides the following options.

Options for Data Loading

Data can be loaded into Snowflake in various ways. The web interface provides a way to load data manually. Additionally, sql queries can be used to load data. The COPY INTO <table> features three ways of copying data

  1. Loading data from an Internal stage
  2. Loading data from a named external stage
  3. Loading data from an external location

Data from GCS can be loaded using one of the last two options. While both these options make use of Storage Integration, Snowflake recommends creating an external stage that references the bucket. Keep reading to follow SpringML’s recommended method.

Storage Integration

Storage Integrations

Storage Integrations allow Snowflake to read/write data from/to Google Cloud Storage.

Integrations are named, first-class Snowflake objects that avoid the need for passing explicit cloud provider credentials such as secret keys or access tokens. Instead, they reference a Cloud Storage service account.

Cloud Storage service account
  1. An external stage references a storage integration object in its definition.
  2. Snowflake automatically associates the storage integration with a Cloud Storage service account created for your account. Snowflake creates a single service account that is referenced by all GCS storage integrations in your Snowflake account.
  3. Many external stage objects can reference different buckets and paths and use the same integration for authentication.

Creating External Storage

In order to load data from a GCS bucket into Snowflake, following steps are required:

1. Configure Storage Integration in Snowflake

CREATE STORAGE INTEGRATION can be used to create integration. Users with ACCOUNTADMIN role or role with global CREATE INTEGRATION privilege can execute this SQL command.

create or replace storage integration gcs_int
type = external_stage
storage_provider = gcs
enabled = true
storage_allowed_locations = (‘gcs://<bucket/path>’);

Use STORAGE_BLOCKED_LOCATIONS parameter to restrict access to buckets. STORAGE_ALLOWED_LOCATIONS can point to multiple buckets/paths as a comma-separated value.

*Note that the location is prefixed ‘gcs://’ instead of the typical ‘gs://’ as used on Google Cloud Platform. Follow the same pattern below when creating the External Stage via SQL or the web browser.

2. Retrieve Cloud Storage service account

Snowflake creates a service account with the above step. A project editor for Cloud Storage should grant access to this service account to access the bucket. In order to retrieve the service account, run the following command:

desc storage integration gcs_int;

query output

3. Grant service account permissions in Google Cloud

The service account needs the following permissions based on the tasks needed to perform:

Data loading only

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list

Data loading with purge option

  • storage.buckets.get
  • storage.objects.delete
  • storage.objects.get
  • storage.objects.list

Data loading and unloading

  • storage.buckets.get
  • storage.objects.create
  • storage.objects.delete
  • storage.objects.get
  • storage.objects.list

4. Create an External Stage

Since the integration was created with ACCOUNTADMIN role, usage permission should be granted before using the integration to create an external stage.

grant usage on integration gcs_int to role sysadmin;

External stages can be created via sql command or from the ui.
SQL command:

use schema mydb.public;
create stage my_gcs_stage
url = ‘gcs://<bucket/path>’
storage_integration = gcs_int
file_format = my_csv_format;

UI:

sql command ui
create stage

Loading Data using External Stage

Loading data requires a virtual warehouse. The context can be set manually in the Worksheet or by running USE <object> commands.

Running the list command returns the list of objects from the external stage

list @gcs_stage;

Multiple files can be loaded by using pattern matching

copy into mytable
from @my_gcs_stage
pattern=’.*sales.*.csv’;

Create the table with the required columns and run copy into command to load from gcs_stage

create or replace table mytable (col1 string, col2 string);
copy into mytable
from @gcs_stage
file_format = (format_name = csv_format);

Monitoring Data Loads

Snowflake history web interface provides the status of each copy into command. Information Schema in each dataset also contains a Load History view to retrieve the list of data loaded into tables.

Monitoring Data Loads

Summary

Data from multiple GCS buckets can be loaded into one Snowflake account. Snowflake creates a single service account that is referenced by all GCS storage integrations in your Snowflake account. This makes it easy to provide permissions to the service account in your GCP projects.

Bulk loading data from GCS into Snowflake can be accomplished easily by configuring storage integration and creating an external stage. Since all the steps can be implemented via sql, creating an automated script is also a possibility. But Snowflake offers a service called Snowpipe that allows you to load near-real time data. In our next blog in this series “Leveraging the Power of Snowflake”, we explore data loading using Snowpipe.

For more information on how SpringML can help you implement Snowflake, email us at [email protected] or fill out the contact form.