We are all aware of the advancements that Snowflake has made in the area of big data and data warehousing. In addition, there are multiple new initiatives that Snowflake has taken to promote its concept of “application disruption” to help monetize applications. You can go through one of my blogs (SpringML’s takeaways from the Snowflake Summit 2022) which covers new exciting features that Snowflake has unveiled in this year’s summit.
In this blog, I wanted to talk about Snowpark, one of the new features that Snowflake is extensively developing. Snowpark has intrigued me a lot since I came into the world of Snowflake. Snowflake continuously enhances its data programmability, providing more flexibility and greater extensibility for users across programming languages and models. In this process, they have developed a unique feature that doesn’t exist for other cloud competitors. Using Snowpark, you can leverage Snowflake’s data cloud capabilities on the fly inside your notebook environment.
On top of that, it ensures data security by making sure the data never leaves the security boundaries of Snowflake. We have always used connectors like ODBC, JDBC, python connectors, etc., for connecting to different databases for data and performing computation on that data. But now, by using Snowpark, we don’t have to worry about computing, plus it’s faster! It’s cool, isn’t it?
Snowpark for Data Science
We have talked a lot about what Snowpark is and how it can help us as data engineers or scientists in our daily lives. However, unless we know when to use Snowpark and how it can benefit us, we won’t be able to utilize it to our full extent. So, to provide a little help, I would like to focus on how Snowpark can be used in data science and go deeper into each use case.
Snowpark can be used in 4 different steps of a typical data science workflow –
- Data Collection
- Feature Engineering and Model Creation
- ML Ops
We know Snowflake’s power to store petabyte-scale data in its data warehouse. The problem arises when we have several complex data systems at our disposal the computing is still not powerful enough to handle that large a data. Therefore, with Snowpark’s capability of using the Snowflake data cloud, we can now read large amounts of data using our preferred language without worrying about memory consumption. In addition, it gives us the flexibility to focus on the logic rather than operational overhead, to set up big data frameworks.
Feature Engineering and Model Creation
Now that reading the data is taken care of, feature engineering is the next step in a data science process. We all can agree that most data science methodologies utilize and love PYTHON!! I have not seen anyone working in data science who doesn’t use Python. However, there is a massive gap in technical competencies regarding data science problems. The gap between the art of analyzing data, feature engineering, and model creation, and the art of efficiently utilizing an extensive data framework is huge. Due to this gap in these two essential techniques, I have seen many machine learning or data science projects take months and years to complete. Therefore, Snowpark provides a beautiful way of integrating the power of Python with the power of distributed processing in Snowflake. It bridges the gap and allows the data scientist to just focus on developing new features, doing EDA, etc., by using pandas, NumPy, and other beautiful python packages.
As Snowpark supports Python in the public preview mode, we can also use any machine learning model (sci-kit-learn, PyTorch, TensorFlow, etc.) to solve various use cases without worrying about the underlying infrastructure demands.
This step is an essential part of the machine learning or data science workflow. Most organizations struggle to bring their models into production securely. Therefore, there has always been a need to make model inferences at scale by securely moving the data from a higher environment to a lower setting. This large-scale data movement has costs and security implications that prevent models from getting deployed to production.
Hence, Snowpark solves this complexity by reversing the direction of flow. Instead of the data coming to the models, Snowpark brings the models to the data. It is made possible by introducing UDFs (Java, Python in the future) that can store models written in Java or Python and then run inference on Snowflake.
Three together, reading data, building models, and model inference make a data science workflow. However, as data scientists, we know the trouble we face beyond just these three steps. To build the best model, one needs to experiment a lot (the art of data science). For example, finding the best model algorithm, setting the best hyperparameters, balancing bias-variance tradeoff, etc. These techniques are not familiar to many data analysts and business experts with extensive domain knowledge (which is very important in a data science problem).
Therefore, Snowpark came up with a solution to democratize machine learning by bringing the power of AWS Sagemaker. Now, joint Snowflake and AWS partners can build, deploy, and test the best ML models natively from Snowflake using just standard SQL query language. AWS Sagemaker utilizes the data already present in tabular format in Snowflake to automatically build, train, tune, and test ML models.
In conclusion, Snowpark brings a lot of power to data engineers, data scientists, and data analysts to expose the true power of big data with minimal effort. In addition, with Snowflake’s vast array of ML partners like H20.ai, Dataiku, DataRobot, etc., one can minimize the operational overhead. Moreover, they can accelerate their production time for critical business machine learning problems.