In the middle of November 2020 there was Snowflake data cloud summit conference. Naturally, it was virtual this year. There was 40+ sessions divided into several tracks covering Migration into Snowflake, Modernization of Data lake, Analytics and ML track, Data Apps Track, Industry solution spotlight, bunch of sessions with Snowflake data heroes about mobilizing your data and last but not least Keynote of the day and couple more „headline“ sessions. All in all it was pretty packed day with lot of interesting sessions.
In this post I would like to provide my summary and view on the recent Snowflake announcements, general feeling how i see the platform and tips for really cool sessions which I liked the most.
I haven’t seen all the sessions yet and probably i won’t see them. I’ve just covered what has been interesting for me — migrations, data lakes, ML, success stories. All in all i have probably seen around 70% of the available content. Let’s start with major announcements and new features.
New features have been announced on the keynote called What’s Next in the Data Cloud . There have been several „big“ features together with live demos but also small reveals mentioned just by the way. For instance Snowflake is preparing serverless execution model for tasks and streams which can automatically size and manage resources. Did you catch that? And now, here are biggest new features.
Snowpark is a new feature which let developers to write a code in their language of choice (Java, Scala, Python) and run it in Snowflake. This should simplify the data pipelines architecture by doing more data pipeline related work directly in Snowflake. There was nice demo on keynote about leveraging external functions for ML model to score customer care calls. Next part showed how to create complete data pipeline in Snowflake thanks to Scala support. And the best thing is that Scala code is pushed down to Snowflake virtual warehouses which is doing the processing. Nice! You do not need any spark environment to run the code but Snowpark will create temporary java function which runs in Snowflake.
I can imagine this will help many projects to simplified their architecture or use techniques which they are already familiar with. All of that can be done in single platform. Apart from ML use cases I think this will be useful for many others like:
- Data quality — imagine running your Deequ code directly from Snowflake
- Data Lakes and Data Apps — build more complex data pipelines directly in platform
- AI and ML — train and run your models directly from Snowflake
Data governance features — Tagging and Row Level Security
Data governance has received several new features which definitely make some things easier. We have had custom solutions for things which will be possible to solve by tags or row level security. It is great to have native support for it.
It will be possible to link a tag to individual object (table, view, column, …). For instance you can mark all PII data with sensitive tag and directly assign policy to that tag which will automatically mask/hide those columns to all users who do not have proper role. It shows how easily will be possible to use several features together (tags, data masking, policies). Possible use cases are wide — you can tag columns per business unit, use case, refresh period or whatever you need.
Row level security
It brings possibility to control access to data on row level and based on values in individual rows. So you can now show rows from particular region (EU, US, APAC) or cost center/business unit just to those users who have correct role.
Other data governance features
There belongs also data masking feature which is already in public preview for couple of months. Another small mention was about improvements in privacy area to simplify compliance in relation to GDPR or any other regulatory requirement.
Unstructured data support
It has been announced that Snowflake will also support unstructured data like audio or video files. Motivation here is obvious — better support for data lake use cases as Snowflake many times during conference mentioned that they want to be complete data platform and serve more uses cases than only DWH. I think there was not mentioned any release date so let’s see.
In terms of performance improvements one general statement has been made. Snowflake is constantly working on improving performance which they proved by comparing run time of queries which ran on August 2019 and August 2020 → 72% of those have improved about at least 50%. Better performance naturally leads to lower cost as you need less time to do your tasks. There has been announced also two new performance features and there was also separated session which provided more details about latest performance features. Session is called: Ease of Performance: Best Practices Using the Latest Performance Features with Snowflake
Search optimization service
This one is already in public preview — you need an enterprise license for it. This feature tries to improve performance of lookup queries on large tables. It is the table-level property. More details are available in documentation here: Using the Search Optimization Service.
Query acceleration service
It should bring more parallelization into queries by scaling out parts of the query. Available in public preview in next release.
You can find all the announcements with more details in following Snowflake post 👉🏻 Data Cloud Summit 2020 Announcements
Worth to check sessions
I would like to point out my top 5 sessions from whole conference. Those have been most valuable for me.
Migrating Zabka to Snowflake
Great story how to become data driven company in just a few years. Zabka did not utilize their data much couple years ago. All started with first use case about pricing optimization and store segmentations which ended up with 10% EBITDA grow. Now they use ML in all areas in the company. For instance for new store location planning or how to optimize available goods on their stores based on usual customers in that area. And they have big plans how to utilize the data more in coming years. Great work! Definitely I can recommend that session.
Continuous Data Pipelines: Foundations and Effective Implementation at Convoy
Good reference case about real time data processing by leveraging technologies like Kafka, Debezium, S3, Snowpipe, Tasks and Streams, DBT and Fivetran. Pretty impressive technology stack! Check if you would like to know their challenges and complete setup.
Building Extensible Data Pipelines with Snowflake
Little bit extend of demos showed on keynote, containing another use cases for external functions and Snowpark together with live demo.
Building a scalable data lake using Amazon S3 and Snowflake
Reference case for data lakes on Snowflake. Interesting architecture for landing zone where all data are stored in S3, Snowflake external tables point to S3 and views on top of external tables match the source tables. Transactional data are ingested via Snowpipe. What next? Matilion as ETL tool and Amazon DMS for data ingestion. Impressive are planned cost savings for 2021: $1 million thanks to decommission of legacy warehouse (covers licensing, storage and servers).
Moving to and Living with Snowflake
Good DWH migration reference case. It describes the migration journey of Sony Interactive Entertainment into cloud from Netezza. So if you plan such migration you might be interested. Session describe their migration strategy, challenges or learnt lessons. They had to migrate 1.2 PB of data.
Streamlining Data Science with Snowflake
I am not skilled data scientist but this sessions present Snowflake possibilities in terms of data science workloads and pipelines. How Snowflake can fit into your current ML pipeline or where to place it because now with the recent announcements the ML models can be deployed directly into Snowflake. It contains also demo about predicting the number of bike trips in NY. There is created Jupyter notebook and used Snowflake connector for Python. Via external function is called Amazon Sage Maker to train the model. This is really nice showcase of new possibilities in relation to ML.
I think conference has shown where Snowflake is heading. They have changed their communication from we are first, one and only data warehouse for the cloud into we want to be your only data platform in the cloud and support more workloads than only DWH. The recent announcements shows they mean it and coming features will definitely help to utilize Snowflake more in Data lake or Data science areas. Personally I am looking forward to try new data governance features same as new integration possibilities because it might open new opportunities. Will see, hopefully soon! 🤞🏻❄️