Snowflake might just be one of the most exciting unicorns in data analytics nowadays. Marketed as ‘the only data warehouse built for the cloud’, the database-as-a-service is built on top of Amazon Web Service API’s. It takes advantage of all recent evolutions in high performance analytics, with close to no overhead or configuration.

In this article, we will explain the full workflow, from staging the data on Amazon Web Services, overloading it into a reporting database, to running the actual queries and analyzing the output. The full process took us merely a few short hours.

Mark Litwintschik is keeping track of an interesting industry benchmark of web-scale data warehousing products, but we felt Snowflake was missing.  We decided to use his methodology and apply it to Snowflake.

The test is based on the NYC taxi-rides dataset, a publicly-available corpus containing registrations from every single taxi ride in New York as of 2009. This accounts for a whopping 2 billion records, perfect for putting Snowflake’s Enterprise Edition to the test.

Shortcut to prepare the data

The raw data, as provided by the taxi companies, isn’t telling the full story. Analytics would be more valuable if we could annotate the geo data. Mark has an excellent article in which he describes how to processes the raw data using the PostGIS extension for Postgresql. Feel free to read the details over at his blog.

Staging the data on AWS S3

Snowflake’s marketing tells us they’re native to the cloud. It’s clear to us that this starts from the staging step. Our preferred way to load data in Snowflake is to store it in a S3 bucket on an AWS account you control, so you can take advantage of parallel loading. Connecting your Snowflake instance to S3 is as straightforward as it is secure. You create a dedicated IAM role in AWS and grant it cross-account access to your Snowflake instance. The other way around, you’d set up a stage – it looks like any regular database table but is prefixed with an @ –  in the Snowflake SQL editor.

Mark’s process yields almost half a terabyte of CSV files for us to ingest. We don’t have any use for the row-based data in this case. One thing we did was to convert the raw CSV files at the end into Parquet. We prefer the columnar format for analysis at scale. As Parquet is a splittable file format, we broke the dataset down into 108 chunks of about 800Mb each, so we could use parallel data loading in Snowflake.

Set up a load cluster and run the actual data load

Once the cross-account authorization is set up, we were good to go and load data. The first thing to do is to set up a processing cluster that will read data from your S3 stage and load it into the database. Just like any other maintenance task, you’d run this using a SQL command:

An extra large warehouse costs 16 credits per hour, billed per second. One credit equals a parallel use of 8 cores, so we have 128 cores at our disposal. Please keep in mind the auto_suspend parameter, which turns off the cluster after 5 minutes of inactivity.

Let’s pause here for a second – we’ve just set up an auto-scaling data warehouse on AWS with no further overhead by executing a few SQL commands!

Part of the setup includes loading the parquet data as-is into one column of a database table, that’ll give us the flexibility to work with the document type representation afterward.

Command + enter fires off the string of queries, so you can have a quick coffee break while the cluster ingests the data and converts it to Snowflake’s internal micro partition format. The 2.1 billion rows were processed in about 18 minutes.

Screen Shot on 2018-08-17 at 17:48:04.png

An earlier load of the same base data, using a warehouse half the power, loaded in about 36 minutes. That’s almost linear performance scaling, partly of course thanks to the chunked Parquet file.

Screen Shot on 2018-08-12 at 16:09:06.png

The variant data type is quite a convenient format for data explorers and data scientists to work with, as the original document structure is kept as-is. However, for reporting, we prefer the traditional, column-based table structure.

Converting from the document structure to the column structure comes down to running this query:

The full version of the query is available on our GitHub page.

The data load took a few more minutes of processing, this time on a medium cluster.

Screen Shot on 2018-08-12 at 17:22:15.png

After running all the queries consecutively, we were able to see their stats in the query history browser. This is the result of the first – and only – run of the query on the newly-loaded dataset. This is worth mentioning, as Snowflake caches the output of queries ran for 24 hours by default.  Running them again, even the largest, and retrieving the output from cache takes about 130 milliseconds per query.

Screen Shot on 2018-08-12 at 17:30:11.png

Here comes the great part. If you don’t own a Snowflake instance yourself yet, don’t worry.  Leave your details in the form below, and we’ll set you up with $400 USD in Snowflake credits for free, valid for one month, together with the instructions to import the below database in the instance.

2018-09-21T07:11:31+00:00

About the Author:

Founder / Data Architect at Tropos.io