Snowflake might just be one of the most exciting unicorns in data analytics nowadays. Marketed as ‘the only data warehouse built for the cloud’, the database-as-a-service is built on top of Amazon Web Service API’s. It takes advantage of all recent evolutions in high performance analytics, with close to no overhead or configuration.
In this article, we will explain the full workflow, from staging the data on Amazon Web Services, overloading it into a reporting database, to running the actual queries and analyzing the output. The full process took us merely a few short hours.
The raw data, as provided by the taxi companies, isn’t telling the full story. Analytics would be more valuable if we could annotate the geo data. Mark has an excellent article in which he describes how to processes the raw data using the PostGIS extension for Postgresql. Feel free to read the details over at his blog.
Staging the data on AWS S3
Snowflake’s marketing tells us they’re native to the cloud. It’s clear to us that this starts from the staging step. Our preferred way to load data in Snowflake is to store it in a S3 bucket on an AWS account you control, so you can take advantage of parallel loading. Connecting your Snowflake instance to S3 is as straightforward as it is secure. You create a dedicated IAM role in AWS and grant it cross-account access to your Snowflake instance. The other way around, you’d set up a stage – it looks like any regular database table but is prefixed with an @ – in the Snowflake SQL editor.
create or replace stage aws_taxirides_stage url=’s3://tropos-taxirides-data/stage’ credentials = (aws_role = ‘arn:aws:iam::XXXXXXXXXXX:role/SnowflakeStageRole’);
Mark’s process yields almost half a terabyte of CSV files for us to ingest. We don’t have any use for the row-based data in this case. One thing we did was to convert the raw CSV files at the end into Parquet. We prefer the columnar format for analysis at scale. As Parquet is a splittable file format, we broke the dataset down into 108 chunks of about 800Mb each, so we could use parallel data loading in Snowflake.
Set up a load cluster and run the actual data load
Once the cross-account authorization is set up, we were good to go and load data. The first thing to do is to set up a processing cluster that will read data from your S3 stage and load it into the database. Just like any other maintenance task, you’d run this using a SQL command:
ALTER WAREHOUSE “ETL_WH” SET WAREHOUSE_SIZE = ‘XLARGE’ AUTO_SUSPEND = 5 MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 3 SCALING_POLICY = ‘STANDARD’;
An extra large warehouse costs 16 credits per hour, billed per second. One credit equals a parallel use of 8 cores, so we have 128 cores at our disposal. Please keep in mind the auto_suspend parameter, which turns off the cluster after 5 minutes of inactivity.
Let’s pause here for a second – we’ve just set up an auto-scaling data warehouse on AWS with no further overhead by executing a few SQL commands!
Part of the setup includes loading the parquet data as-is into one column of a database table, that’ll give us the flexibility to work with the document type representation afterward.
use warehouse ETL_WH; create table taxirides.public.TEST ( rawdata variant; ) copy into taxirides.public.TEST from @aws_taxirides_stage FILE_FORMAT = ( type = “PARQUET”);
Command + enter fires off the string of queries, so you can have a quick coffee break while the cluster ingests the data and converts it to Snowflake’s internal micro partition format. The 2.1 billion rows were processed in about 18 minutes.
An earlier load of the same base data, using a warehouse half the power, loaded in about 36 minutes. That’s almost linear performance scaling, partly of course thanks to the chunked Parquet file.
CREATE OR REPLACE TABLE ridehistory ( vendor_id string, pickup_datetime timestamp, … ) AS SELECT $1:vendor_id::string, $1:pickup_datetime::timestamp, … FROM taxirides.public.test;