Challenges when using Snowflake for building ELT pipelines

Snowflake platform has been becoming more and more mature in many ways. There have been constantly added new features in different areas and workloads. One can start thinking about using the platform for use cases and tasks which would have been impossible couple years ago. As good example of such approach might be utilizing various Snowflake features together and use the Data Cloud platform for building modern ELT pipelines. In other words using Snowflake also for data extraction, loading & transformation.

Is it even possible? Sure it is! But you might be facing several challenges which wouldn’t be obvious at the beginning. I have done that for several my use cases and during today’s blog post I would like to highlight, what will be waiting for you to deal with, when you decide to follow the same path.

Why using Snowflake platform for ELT

Why would anyone even think about using the Data Cloud platform as ELT tool instead of dedicated tools like DBT, Matillion, Fivetran and many others? Having everything in one platform could have benefits:

  • simple technology stack with less tools
  • cost save — no additional licenses, running additional virtual machines, etc.
  • everything in one place can simplify the maintenance
  • security — everything runs inside Snowflake. Strong RBAC model, encryption, SaaS model, etc.
  • platform is being constantly improved

Snowflake already offers features which are going to help you with building automatic data pipelines. Starting with data ingestion (COPY command, Snowpipes) and transformations (Tasks, Stored Procedures, UDFs, Snowpark, etc.). You can even do change data capture and process only newly loaded data (Streams). As you can see, the available feature set is already very wide, offering everything needed for easy and complicated pipelines.

Snowflake has recently added two more features related to tasks. They push usage of the platform as a transformation tool to the next level. First added thing is task tree vizualization as a DAG. Before there was not an easy way how to see the dependencies between tasks. You could have queried some system views to get out the dependencies but the visual form is much better, easier to understand and it follows the industry standards how the data pipelines are visualize.

Task dependency visualize as a DAG

It can speed up the adoption of this feature as more and more people will consider it as usable and providing value out of the box. No more workarounds for seeing the complete pipeline. Another thing — now you can also see the task history in the same view, just different tab. Again without any further querying the system views. This will be welcomed mainly by operation teams which are taking care about keeping the pipelines running. Now all information which might be needed are available on couple of clicks.

Task history in the new UI

Second added thing is new Snowflake object which allows Snowflake integration with cloud messaging service. Now you can receive a message in case there is some failure on Snowflake side. The object is called Notification integration.

It can be used either for Error Notification in Snowpipes or for Error Notification for Tasks. This feature is currently (November 2022) in public preview and supports only AWS SNS as a cloud messaging provider.

This is going to significantly help with data pipeline automation as now you and your team can receive a notification in case there is any failure on Snowpipe or Task side. I have separate blog about this. It describes how you can integrate this error notification for Tasks with Slack and get the Slack message in case of failure. You can find the blog post here. 👇

Error Notifications for Snowflake Tasks | by Tomáš Sobotík | Snowflake | Sep, 2022 | Medium

My solution

I have tried to use purely Snowflake features for one of my data pipelines as I’ve seen a good fit for it. It is about near real time data integration into Snowflake, then data needs to be transformed and send to another application via API. Near real time here is perhaps not the best description as there is a new file on cloud storage (AWS S3) every 15 minutes. But I believe thanks to used Snowflake features and overall architecture, it can easily scale up to process messages with higher frequency. The solution is based on combination of features like Snowpipes, streams and tasks with various integrations like that error notifications mentioned earlier. Basic and high level architecture schema could be found below.

High level solution architecture

I use snowpipes with auto-ingest. I have several snowpipes as there are several data files. Snowpipes just import data from source files into Snowflake RAW tables as they are. There aren’t done any transformations. Then I use combination of Snowflake streams and tasks for data processing. Whole data transformation consists of several steps and intermediate tables. On top of each table there is a Snowflake stream to cover changed records. All tables in the pipeline persist their state (they are not truncated) and thanks to streams I always process only the newly added records. The pipeline is easy and straightforward. It can be represented by following DAG:

Data pipeline as dependent tasks

You can see according the DAG representation that it is quite straightforward and easy one. No multiple dependencies, just few steps. This new available UI really helps during the development. When I was developing it, it had not been available yet. Many times I felt lost in sense I was not sure how tasks are connected to each other and have to query task definitions over and over.

What is Snowflake still missing?

Let’s focus now on things which I still miss in Snowflake platform. Following points are my own observations and I think when some of those would be added, it would improve overall efficiency in using Snowflake platform for data engineering tasks. On the other hand, I am aware of the fact that Snowflake is mainly data platform, not a transformation tool. I think it is good to highlight these points as it could help realize what you would need to deal with if you decide to use purely Snowflake for your transformations.

Missing native GIT integration

I really hope this is one of the most requested features. 🙂 I have been asking for it since my early days with Snowflake. There is no native & direct integration with GIT. If you want to keep your project under version control, you need to do it on your own and manually. Especially here, when you are building a complete data pipeline containing many objects (tables, streams, tasks, snowpipes). It is really necessary to keep the DDL under version control. You will be changing them tons of times. Doing that without having everything in GIT would be too cumbersome. I prefer using Snowsight for my development which means making manual DDLs copy, and sync with GIT. For GIT integration I use VSCode.

There is possible to use VSCode also for developing the Snowflake code. You can find more in following blog post: Using Visual Studio Code with Snowflake. Then the GIT integration would be more straightforward but I have never used to use it for development.

Missing the support provided by “real“ ELT tools

This point covers multiple things. I am sure you have ever used some ELT tool. Perhaps you have never realized how much the tools help you. They do your life easier and work more efficient. Transformation tool can do things like logic validations — before you run the code you can get a warning that you have some typo in column name or SQL statement. It has happened many times to me. I had a typo in the code. It has caused me a lot of troubles because the pipeline is based on using Snowflake streams as data source for each step. Running a task or SQL which is wrong still caused that stream has been consumed and I had to prepare the data for development again. Having some validation in place before the code is really executed would save me a lot of time.

Managing datasets

Another task which transformation tool can do for you is managing the datasets. You do not need to take care about writing CREATE TABLE scripts or ALTER scripts when adding/removing some columns. You just update your transformation logic and the tool does the work for you — update the underlying objects. Not here. You need to take care about this yourself. This has taken a lot of time to me. A lot of time which I had to spend by repeatable, boring task which is error prone. It is easy to make a mistake when you have to handle this manually. Imagine you have a pipeline consisting of tens intermediate tables and you need to add a column.

Missing documentation

A lot of tools can automatically generate some documentation about your datasets and developed logic. Provided documentation is many times good enough and easy to use. Yes, Snowflake has internal DB and many views and table functions in ACCOUNT_USAGE or INFORMATION_SCHEMA. But you have to dig in and build it by yourself.

Better visibility of task/pipeline history

This is mainly for operation teams. I am used to have at least some basic overview about data pipeline status. Not only the current status but also some history overview to see the trend and overall performance. It can help to identify some discrepancies in data processing earlier than they are spotted by business users. I miss something what would be available by default. But no worries the data are available in SNOWFLAKE internal DB, so it is possible to build your own dashboard summarizing whole pipeline.

I’ve build something like that myself in Snowsight. In single dashboard I have one week history of task runs + error count and snowpipes history to see how data flows into my system. This gives me the basic overview which I think is good to have. This + error notification should offer quite strong foundation for smooth and easy pipeline maintenance.

Data pipeline status dashboard

I wish to have some built in „library“ with predefined visualizuations which I would combine as blocks to my dashboard and just configure few parameters. Maybe in the future with coming Streamlit integration? 🙂

Overall efficiency

I have been thinking how efficient I’ve been when I was developing this pipeline. I was significantly slower compared to doing same thing in some dedicated transformation tool. That’s fine but because of the reasons I’ve described above, I assume, I would be still slower even though I would be doing it again next time. My estimate is around 50% slower.

Wrap up

Would I use Snowflake features and build complete ELT pipeline again only in Snowflake? Yes, I would! Even though there are many obstacles. It might take more time but I still think there are use cases where it make sense ( like mine use case used for this blog post). I wouldn’t use it for each and every data pipeline. When I would be more sceptical:

  • pipeline containing many intermediate datasets,
  • pipeline combining data from multiple sources,
  • pipeline interacting with other systems outside of Snowflake,
  • pipeline sharing data with other systems via some API.

By this blog post I wanted to highlight what kind of obstacles you might be facing, if you decide to build data transformation pipeline purely in Snowflake. I hope it helps.

--

--

Tomáš Sobotík
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Lead data engineer & architect, Snowflake Data Superhero and SME, O'Reilly instructor. Obsessed by cloud & data solutions. ☁️ ❄️