Dataiku is one of the leaders in AI/ML segment and their tool called Data Science Studio is really a gem but can you use that tool as your one and only ELT / ETL tool for your cloud data warehouse? No matter if you are going to build it on top of Snowflake, Cloudera or any other cloud based technology? Sure thing! We did it and not only once. Read my experiences with building extensive data pipelines with Dataiku based on various use cases we have been working on during last almost 3 years.
Dataiku has been founded 2013 and today it is startup with „unicorn“ status valued at $1.4 billion. Dataiku has been also named as a leader in Gartner Magic Quadrant for Data Science platforms. Last but not least Dataiku is also one of the best employers in US for 2021. Their tool Data Science Studio brings many great features in single tool and provide single platform and common language for collaboration of all roles involved in data products development, starting with analyst through data engineers up to data scientists or decision makers. In this post I would like to go through its features, challenges or pitfalls related to development of data pipelines as that is not the main and primary focus of the tool but believe me, it is absolutely possible to replace tools like Informatica with Dataiku and use it as your primary or one and only ELT/ETL tool. Let’s jump into it!
For one of our customers we have been working with Dataiku since June 2018 and since day one Dataiku has been selected as primary ETL tool. We have built tens of different use cases over those 3 years and fine tuned our setup to fix the gaps in features which we have discovered so far. It has started as many other data related projects in this cloud positive times — migration of DWH from on-prem into cloud (Cloudera), later we did Cloudera → Snowflake migration and last but not least we have been working on another modern data app built on top of Snowflake with Dataiku. Nowadays Snowflake is our primary storage/computation layer. Dataiku has been playing key role in all projects and allow us to deliver use cases within short period of time. Why? Because of its architecture, way of working and features it offers. Let’s look on the way of working with Dataiku in relation to building data pipelines.
Way of Working
Dataiku same as Informatica or Microsoft SSIS is mainly visual tool where you work with various components which you connect into complex data flow. Dataiku call these transformations recipes. You can use typical transformations used in data pipelines like Join, Window or Group. There is many more, you can find them in the image below or here in documentation. For instance Prepare recipe allow you to do quite extensive data transformation tasks starting with different formatting options, through filtering rows and columns up to complex transformations based on DSS formula language.
There are not only visual components if you need to write a piece of code and include it into your flow you can use code recipe. Type of available code recipes depends on what languages and engines you have installed on your DSS server. It start with obvious SQL code through Python up to Spark or R. This combination of visual and code components give you strong foundation to fulfil each data related task.
That is just a beginning and probably the basic building blocks of your flows but it is not all. Number of available components is much wider and it includes recipes related to building data science models, various notebook support to test your code and ideas, possibility to write own webapps and extend the features of DSS. It also has API for Python, R or Public REST API which you can use to interact with DSS through code and build various automations which you need.
How to get familiar with DSS?
If you want to learn how to work with DSS I can strongly recommend Dataiku Academy where you can find many learning paths and even certifications for Dataiku. Feel free to go and try. It is free.
How Dataiku performs compared to Informatica
Let’s have closer look how Dataiku performs compared to one of the leaders in data transformation tools segment — Informatica. I have built many projects in Informatica. Projects have been based on different DBs including Oracle, Netezza or Teradata. I have been Informatica developer for more than 8 years so I would say I have quite extensive knowledge of the tool, way of working, pros and cons. When I say Informatica I mean the PowerCenter tool. I know they have also cloud based ETL called cloud data integration but I have never used that tool.
DSS client is accessible via web browser. You do not need to install anything on your computer. App has modern, responsive and pixel perfect design. Thanks to browser based IDE it works on any platform. Informatica has client only for Windows, meaning you have to install it on your local computer. From design point of view Informatica is stuck in nineties and it looks same as long as I know it.
Version control integration have both tools, however I have feeling that just one of them is making your work easier and version control is useful not pain. Seamless GIT integration in DSS is really pleasure to use. It does not block you from your work, you can easily go back to any commit in the past, compare them etc. My experience with version control in Informatica is everything but not seamless. There have been always issues. Most typical one is „you can’t modify this object because it is opened by someone else …“. So what? Using version control in Informatica has been always pain for me. It has been complicating the work and make it less productive.
Different level of abstraction
Mainly for beginners could be many times complicated to understand the way of working with Informatica. You have those mappings, sessions and workflows. You need to understand what kind of task you are solving on which level and how to combine it together. DSS has different (and for me easier) model how to prepare complete data flow. If I should compare Informatica and DSS components I think this could work:
- mapping in Informatica = data flow in DSS
- workflow in Informatica = DSS scenario
Informatica session which is basically „instance“ of mapping where you set the connections, logging information and other configuration is something what I think does not have equivalent in DSS and it is good because it makes development fast and easier as you do not have this layer of abstraction.
Easier way of working
What is really great thing for me in DSS — you are connecting whole tables/datasets between certain steps which is much faster and easier than connecting each and every column between transformations which you need to do in Informatica mapping. I was always getting mad with tables with 100+ columns where you need to go and lead each column from one transformation to another. It is so ineffective.
Another example is variables handling. In DSS you can have global variables or local project variables. Informatica has parameters and variables on workflow or mapping level if you need value of workflow variable in mapping you need to pass it down and vice versa from mapping you need to pass it up to workflow. It work but It just feel strange to me when I see different approach which is much more effective. As I already mentioned DSS has quite wide API where the same variables are also available.
As last point in comparison I have support. I have to admit that I have never seen better support than Dataiku provides. They are so responsive and helpful! You just send email and usually within 30 minutes you have relevant answer for your issue. It is answer from real Dataiku support engineer, no automatically generated message or generic answer saying „Have you tried turning it off and on?“. This is really great and huge thumbs up for Dataiku support. 👍
On the other hand. Have you ever tried to contact Informatica support? Well, I do and I could end up here. Let me just say that it can take weeks during which you need to go through different levels of support trying to send you just links from knowledge base (that is the better case).
Let’s focus on benefits and pros which Dataiku can offer to users and developers. As the first one I would definitely mention fast learning curve. Getting basic understanding how to work with the available toolset, what is the main way of working, what kind of components are available is really easy and straightforward. Our team have been onboarding several new developers during last 3 years, varying from total juniors up to lead developers. All of them have been able to get the basic principles under the skin within couple of days and they could start working on real tasks. Why is that?
- great UI/UX which is easy to learn and understand
- strong and limited set of basic data transformation components which are good enough to solve majority of typical data preparation tasks
- unified way of working with each recipe
Again, I have to compare with Informatica. From my own experience, things which I would be doing for a week in Informatica I am able to manage in just two days in Dataiku. Why? Again, it is combination of what I’ve already mentioned like different level of abstraction, easier way of working when you work with complete datasets between transformations, not with single columns, great UI, strong API and variable handling, etc. All of that gives me strong framework which I utilise. It has a big impact on whole delivery because suddenly you are able to deliver more in same time, which on one side save money and on the other positively influences team velocity and trustiness.
Dataiku collaborative features are another piece into the complete puzzle of great tool. Features like discussion around each recipe or dataset where you can have multiple threads like on discussion forums makes information exchange between various roles working on the project (data engineers, analyst, business users or data scientist) smoother, easier and faster. And because everything is part of the platform you do not need to dig into emails to find what the colleague X has said about this and that logic or implementation detail. You have everything in one place, visible to everyone.
Not a single tool but robust platform
Dataiku DSS is not just a single tool (ETL or ML/AI tool). It is a platform providing features starting with data profiling and data analysis, through data pipeline development, up to building ML models on top of the datasets. Having a single tool which can handle all of that and it handles it in great way might be game changer for many companies.
Smooth and easy integration/support of modern cloud technologies
DSS is modern tool which works smoothly with latest technologies from cloud space. Over the years we have used it together with many execution engines (Spark, Impala, MapReduce), databases (Snowflake, Athena) or big data technologies like Hive or HDFS. Adding also coding support for Python, R, SQL, Shell scripts. What else might you need for your data projects?
Dataiku continuously improves their tool. We have started with version 5.0 almost 3 years ago and today there has been recently introduced version 9.0. Dataiku is actively listening the feedback from users and community and trying to improve the tool based on that. I have published list of my pitfalls when using Snowflake in DSS more than year ago. Some of them have been already addressed. I can clearly see that with every release there is bunch of features which has been improved or newly introduced. For instance the latest release contains experimental streaming support or improved deployment process between environments.
Of course there are also limitations and drawbacks of DSS. Nothing is 100% crystal clear and perfect. As first thing I would mention that DSS is not built to support complicated delta loads which you can smoothly implement in Informatica. I am not saying it can’t be done but it just won’t be so elegant and easy to do. Visual recipes allows you to append data into the target instead of overwrite but you can’t define any complex logic how to handle insert/deletes/updates. Basically you can do it via SQL code recipes where you will write your merge/upsert logic manually.
If you have large DWH with tens or hundreds of data pipelines it is becoming too hard to orchestrate everything via scenarios and triggers because you can have scenario which is waiting for 2 or more datasets which needs to be built in advance by two different scenarios like on example below. Imagine you have project C which has input datasets A1 shared from project A and B2 shared from project B. Those needs to be refreshed first and then you can run scenario in project C. This can be tricky to define scenario trigger for it and this is just easy example. We have much complicated similar dependencies where one scenario in project X waits for tens of datasets from tens other projects. We were getting lost in orchestration just with 10+ projects and mutual dependencies between them.
Thanks to Dataiku API we have built custom orchestration tool which calculates readiness of each dataset in each scenario and if all input datasets are „ready“ meaning that they have been already refreshed then the scenario is triggered by this orchestration tool. That is basic idea behind and of course over the years our solution is being improved and nowadays it handles much more.
Same issue here like with orchestration. Consider having tens or hundreds of projects with mutual dependencies where datasets are shared between them. Change in schema in one project means you have to update and deploy all dependent projects where this dataset is used. This become complicated with large setups. To solve this we have developed custom web app directly in DSS which we use for deployments. It automatically checks if there is schema change which requires to deploy other projects which also use the same (updated) dataset and notify you to include it into deployment. Then, with single click the app prepares the project bundles and automatically deploy them to automation node behind the scenes. So no more manual bundle creation, export and import for each of them.
Dataiku introduced improved deployments handling in version 9.0 which is definitely right move but I still believe that our custom tool, which we have been improving and tweaking over last three years will serve us better,
Wow, it is longer than I expected. I’ve tried to provide overview how well could be Dataiku used as main or only ETL/ELT tool for your next cloud data project. The big benefit of Dataiku is that you don’t get just ETL tool but complete platform for your data which can be used by all participating roles. There are some limitations which I tried to address. All in all I would definitely recommend to at least include Dataiku into your decision process and evaluate if it can serve your needs, same as it serves to us for years already.