Dataiku as ETL tool for your next cloud DWH?

Dataiku is one of the leaders in AI/ML segment, and their tool called Data Science Studio is really a gem, but can you use that tool as your one and only ELT / ETL tool for your cloud data warehouse? No matter if you will build it on top of Snowflake, Cloudera, or any other cloud-based technology? Sure thing! We did it, and not only once. Read my experiences with building extensive data pipelines with Dataiku based on various use cases we have been working on over almost 3 years.

Dataiku was founded in 2013, and today, it is a startup with a „unicorn“ status valued at $1.4 billion. Dataiku has also been named as a leader in Gartner Magic Quadrant for Data Science platforms. Last but not least, Dataiku is also one of the best employers in the US for 2021. Their tool Data Science Studio brings many great features in a single tool. It provides a single platform and common language for collaboration of all roles involved in data product development, starting with analysts through data engineers and data scientists or decision-makers. In this post, I would like to go through its features, challenges, or pitfalls related to the development of data pipelines as that is not the primary focus of the tool but believe me, it is absolutely possible to replace tools like Informatica with Dataiku and use it as your primary or one and only ELT/ETL tool. Let’s jump into it!

Our story

For one of our customers, we have been working with Dataiku since June 2018, and since day one, Dataiku has been selected as the primary ETL tool. We have built tens of different use cases over those 3 years and fine-tuned our setup to fix the gaps in features that we have discovered so far. It has started like many other data-related projects in this cloud positive times — migration of DWH from on-prem into the cloud (Cloudera), later we did Cloudera → Snowflake migration, and last, but not least, we have been working on another modern data app built on top of Snowflake with Dataiku. Nowadays, Snowflake is our primary storage/computation layer. Dataiku has played a key role in all projects and allows us to deliver use cases within a short period of time. Why? Because of its architecture, way of working, and features it offers. Let’s look at the way of working with Dataiku concerning building data pipelines.

Way of Working

Dataiku, the same as Informatica or Microsoft SSIS, is mainly a visual tool where you work with various components that connect to a complex data flow. Dataiku calls these transformations recipes. You can use typical transformations used in data pipelines like Join, Window, or Group. There are many more; you can find them in the image below or here in the documentation. For instance, prepare a recipe to do quite extensive data transformation tasks starting with different formatting options by filtering rows and columns up to complex transformations based on DSS formula language.

DSS visual recipes

There are not only visual components; if you need to write a piece of code and include it into your flow, you can use a code recipe. The type of available code recipes depends on what languages and engines you have installed on your DSS server. It starts with obvious SQL code through Python up to Spark or R. This combination of visual and code components gives you a strong foundation to fulfill each data-related task.

DSS code recipes

That is just the beginning and probably the basic building blocks of your flows, but it is not all. The number of available components is much wider. It includes recipes related to building data science models, various notebook support to test your code and ideas, and the possibility of writing your own web apps and extending DSS features. It also has API for Python, R, or Public REST API, which you can use to interact with DSS through code and build various automation which you need.

How to get familiar with DSS?

If you want to learn how to work with DSS, I can strongly recommend Dataiku Academy to find many learning paths and even certifications for Dataiku. Feel free to go and try. It is free.

How Dataiku performs compared to Informatica

Let’s have a closer look at how Dataiku performs compared to one of the leaders in the data transformation tools segment — Informatica. I have built many projects in Informatica. Projects have been based on different DBs, including Oracle, Netezza, or Teradata. I have been an Informatica developer for more than 8 years, so I would say I have extensive knowledge of the tool, way of working, pros and cons. When I say Informatica, I mean the PowerCenter tool. I know they also have a cloud-based ETL called cloud data integration, but I have never used that tool.

Modern UI

DSS client is accessible via a web browser. You do not need to install anything on your computer. App has a modern, responsive, and pixel-perfect design. Thanks to browser-based IDE, it works on any platform. Informatica has client only for Windows, meaning you have to install it on your local computer. From a design point of view, Informatica is stuck in the nineties, and it looks the same as long as I know it.

DSS UI

GIT integration

Version control integration has both tools; however, I feel that just one of them is making your work easier, and version control is useful, not pain. Seamless GIT integration in DSS is really a pleasure to use. It does not block you from your work; you can easily go back to any commit in the past, compare them, etc. My experience with version control in Informatica is everything but not seamless. There have always been issues. The most typical one is, you can’t modify this object because it is opened by someone else …“. So what? Using version control in Informatica has always been a pain for me. It has been complicating the work and make it less productive.

Different levels of abstraction

Mainly for beginners could be many times complicated to understand the way of working with Informatica. You have those mappings, sessions, and workflows. You need to understand what kind of task you are solving and how to combine it. DSS has a different (and, for me, easier) model of preparing complete data flow. If I should compare Informatica and DSS components, I think this could work:

  • mapping in Informatica = data flow in DSS
  • workflow in Informatica = DSS scenario

Informatica session, which is basically an instance“ of mapping where you set the connections, logging information, and other configuration, is something I think does not have equivalent in DSS. It is good because it makes development fast and easier as you do not have this abstraction layer.

An easier way of working

What is really great for me in DSS is that you are connecting whole tables/datasets between certain steps, which is much faster and easier than connecting each column between transformations that you need to do in Informatica mapping. I was always getting mad with tables with 100+ columns where you need to go and lead each column from one transformation to another. It is so ineffective.

Another example is variables handling. In DSS, you can have global variables or local project variables. Informatica has parameters and variables on workflow or mapping level. If you need the value of the workflow variable in mapping, you need to pass it down, and vice versa from mapping; you need to pass it up to workflow. It works, but it feels strange to me when I see a much more effective approach. As I already mentioned, DSS has quite a wide API where the same variables are also available.

Support

As the last point in the comparison, I have support. I have to admit that I have never seen better support than Dataiku provides. They are so responsive and helpful! You send an email, and usually, within 30 minutes, you have a relevant answer for your issue. It is an answer from a real Dataiku support engineer, with no automatically generated message or generic answer saying, “Have you tried turning it off and on?” This is really great, and a huge thumbs up for Dataiku support. 👍

On the other hand. Have you ever tried to contact Informatica support? Well, I do, and I could end up here. Let me say that it can take weeks, during which you need to go through different levels of support trying to send you links from the knowledge base (that is the better case).

Dataiku benefits

Let’s focus on the benefits and pros which Dataiku can offer to users and developers. As the first one, I would definitely mention the fast learning curve. Getting a basic understanding of working with the available toolset, the main way of working, and the available components are effortless and straightforward. Our team has been onboarding several new developers during the last 3 years, varying from total juniors to lead developers. All of them have been able to get the basic principles under the skin within a couple of days and start working on real tasks. Why is that?

  • great UI/UX, which is easy to learn and understand
  • strong and limited set of basic data transformation components, which are good enough to solve the majority of typical data preparation tasks
  • the unified way of working with each recipe

Faster development

Again, I have to compare with Informatica. From my own experience, I would be doing things for a week in Informatica I can manage in just two days in Dataiku. Why? Again, it is a combination of what I’ve already mentioned, like different levels of abstraction, an easier way of working when you work with complete datasets between transformations, not with single columns, great UI, strong API, variable handling, etc. All of that gives me a strong framework, which I utilize. It has a big impact on whole delivery because suddenly you can deliver more in the same time, which on one side saves money and positively influences team velocity and trustiness.

Collaborative platform

Dataiku collaborative features are another piece of the complete puzzle of great tools. Features like the discussion around each recipe or dataset where you can have multiple threads like on discussion forums make information exchange between various roles working on the project (data engineers, analyst, business users, or data scientist) smoother, easier and faster. And because everything is part of the platform, you do not need to dig into emails to find what colleague X has said about this and that logic or implementation detail. You have everything in one place, visible to everyone.

Not a single tool but a robust platform

Dataiku DSS is not just a single tool (ETL or ML/AI tool). It provides features starting with data profiling and data analysis, through data pipeline development, up to building ML models on top of the datasets. Having a single tool that can handle all of that and handles it is a great way might be a game-changer for many companies.

Smooth and easy integration/support of modern cloud technologies

DSS is a modern tool that works smoothly with the latest technologies from cloud space. Over the years, we have used it with many execution engines (Spark, Impala, MapReduce), databases (Snowflake, Athena), or big data technologies like Hive or HDFS. Adding also coding support for Python, R, SQL, Shell scripts. What else might you need for your data projects?

Continuous improvement

Dataiku continuously improves its tool. We started with version 5.0 almost 3 years ago, and today, there has been recently introduced version 9.0. Dataiku is actively listening to the feedback from users and the community and trying to improve the tool based on that. I have published a list of my pitfalls when using Snowflake in DSS more than a year ago. Some of them have already been addressed. I can clearly see that with every release, many features have been improved or newly introduced. For instance, the latest release contains experimental streaming support or improved deployment process between environments.

Drawbacks

Of course, there are also limitations and drawbacks of DSS. Nothing is 100% crystal clear and perfect. I would mention that DSS is not built to support complicated delta loads that you can smoothly implement in Informatica. I am not saying it can’t be done, but it just won’t be so elegant and easy to do. Visual recipes allow you to append data into the target instead of overwriting, but you can’t define any complex logic for handling insert/deletes/updates. Basically, you can do it via SQL code recipes to write your merge/upsert logic manually.

Orchestration

If you have a large DWH with tens or hundreds of data pipelines, it is becoming too hard to orchestrate everything via scenarios and triggers because you can have a scenario that is waiting for 2 or more datasets that need to be built in advance by two different scenarios like on the example below. Imagine you have project C with input datasets A1 shared from project A, and B2 shared from project B. Those need to be refreshed first, and then you can run a scenario in project C. This can be tricky to define a scenario trigger for it, which is just an easy example. We have many complicated, similar dependencies where one scenario in project X waits for tens of other projects' datasets. We were getting lost in orchestration just with 10+ projects and mutual dependencies between them.

Thanks to Dataiku API, we have built a custom orchestration tool that calculates the readiness of each dataset in each scenario. If all input datasets are „ready, “ meaning that they have been already refreshed, then the scenario is triggered by this orchestration tool. That is the basic idea behind, and of course, over the years, our solution is being improved, and nowadays, it handles much more.

Example of mutual dependencies between projects

Deployments

Same issue here, like with orchestration. Consider having tens or hundreds of projects with mutual dependencies where datasets are shared between them. Change in the schema in one project means updating and deploying all dependent projects where this dataset is used. This becomes complicated with large setups. We have developed a custom web app directly in DSS, which we use for deployments to solve this. It automatically checks if schema change requires deploying other projects that also use the same (updated) dataset and notifies you to include it into deployment. Then, with a single click, the app prepares the project bundles and automatically deploys them to the automation node behind the scenes. So no more manual bundle creation, export, and import for each of them.

Dataiku introduced improved deployments handling in version 9.0, which is definitely the right move, but I still believe that our custom tool, which we have been improving and tweaking over the last three years, will serve us better,

Summary

Wow, it is longer than I expected. I’ve tried to overview how well it could be Dataiku used as the main or only ETL/ELT tool for your next cloud data project. The big benefit of Dataiku is that you don’t get just an ETL tool but a complete platform for your data which all participating roles can use. There are some limitations that I tried to address. I would definitely recommend including Dataiku into your decision process and evaluating if it can serve your needs, same as it serves to us for years already.

--

--

Tomáš Sobotík
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Lead data engineer & architect, Snowflake Data Superhero and SME, O'Reilly instructor. Obsessed by cloud & data solutions. ☁️ ❄️