Pitfalls when using Snowflake in Dataiku DSS

I have been using Snowflake DWH as a target DB for more than 7 months. We have been developing data pipelines in Dataiku DSS. During that time I have come across small pitfalls or differences compared to HDFS datasets in terms of functionality. Generally speaking those are mainly things related to handling particular options of different snowflake commands. It might be good to be aware of such things and not be surprised so let’s go through them.

Dataiku has support page which contains all the details in relation to Snowflake support in DSS. There is also info about installation of JDBC driver.

There are limitations and known issues when using Snowflake in DSS. Those are documented by Dataiku in same support page like is linked above.

For instance there are some processors in prepare recipe which are not supported if you want to use In-database (SQL) engine. You can find the list of supported ones here:

If you need to import data from Parquet files you can’t currently use S3 to Snowflake as engine. It is not supported by DSS. Currently you can import only from CSV files. There is workaround how to do it. You need Python for it. I have already written article how to import data from Parquet files into Snowflake and use DSS for it. You can check it if it is your case.

Same approach (Python) could be used also for offloading from Snowflake into Parquet in DSS because again it is not supported by DSS.

If you create new managed dataset in DSS and you want let DSS to create underlaying Snowflake table then such table is created in lower case (test_table). If you also use ${project_key} variable as part of table name then you can end up with table name like PROJECT_KEY_table_name. Meaning combination of upper & lower case.
Unfortunately DSS can’t modify this behavior right now and table is always created with lower case no matter if you use UPPER CASE for dataset name. If you want to change it you have to always go to dataset settings and manually change table name into upper case.

Modify the table name here

In July 2019 I asked about some possibility to define if table should be created in lower case or upper case and Dataiku did not have any ETA for such functionality.

If you want to query table with such name in Snowflake, you have to always enclose the table name into quotes:

SELECT * FROM “PROJECT_KEY_table_name“

Unquoted identifiers are case-insensitive so if you have table name like above and you do not use quotes then Snowflake engine is not able to find your table and you will get error like table does not exist.

I am lazy developer and I do not want to use quotes around identifiers and I also think that it is much better be consistent and use UPPER CASE for identifiers so I always go to dataset settings and manually update the table name into UPPER CASE.

More about Snowflake identifiers in their documentation

I assume Snowflake COPY command is used when you do import or offload from/to file and use S3 to Snowflake as an engine. There is no advanced settings you do not know what parameters DSS use for COPY command which has quite many options to handle different scenarios.

I have come across various situation when I have needed to modify the copy command options for example define FIELD_OPTIONALY_ENCLOSED_BY option to say what is the enclosing character in case column value contains separator as part of data. You can’t do it right now. So again you have to either use Python where you can write the COPY command by yourself or do it directly in Snowflake. Another case was situation when I have had backslash as part of value

I used to use share functionality in DSS for HDFS datasets in case I wanted to use same dataset in different projects. It helps data lineage to have objects defined on place and reused them in so many places as needed. Of course you can also share Snowflake datasets between projects but you have to use same connection in both projects otherwise you can’t use in-database (SQL) engine. All objects in each and every recipe must use same connection to be able to use in-database (SQL) engine. It might be quite logical but it is good to have it in mind when designing warehouses and connections for your project.

In my case we are using different connections with different warehouses for loading data into Snowflake and for data pipeline itself. Just for being able to scale each part differently/separately and also monitor the credit usage separately for each phase. Currently I am not using dataset sharing but define the table again in new project with required connection.

DSS internally handles all „dates“ as timestamp with time zone, so be prepared for that. If you use TIMESTAMP without time zone or TIMESTAMP with different time zone than UTC all your dates will be converted to UTC during processing so it might happen you will see different value in database than you would expect.

I used to have datetime column with Europe/Copenhagen timezone as partition key. So DSS converted my value 2020–01–01 00:00:00 into UTC equivalent: 2019–12–31 23:00:00 which caused data ended up in wrong partition.

Such behavior should be possible to adjust by going into Settings > Connection tab of dataset and uncheck „Read SQL timestamp without timezone as DSS dates“, or changing the „assumed time zone“ to UTC.

Datasets created by DSS as output of recipes will use timestamptz, unless you alter the „table creation sql“ in their Settings > Advanced tab.

I just noticed that Dataiku says in documentation that for time-based partitioning on SQL table, the partitioning column must not be of Date type but it must be string. Maybe I overlooked that when I was struggling with all those time zones. I would save a lot of time, it is really pain 🙂

That’s all for now. I think nothing serious just minor things which is good to know if you plan to use Snowflake in Dataiku DSS which is anyway great tool to work with. 🤟🏻

Lead data engineer @TietoEVRY. Currently obsessed by cloud technologies and solutions in relation to data & analytics. ☁️ ❄️