We analyzed 2 Billion Taxi Rides In Snowflake in just a few hours

We built a data model to process large sets of data using SQL and the Snowflake service; and put its performance to a test. In just a few hours.
Joris Van den Borre

Snowflake might just be one of the most exciting unicorns in data analytics nowadays. Marketed as ‘the data cloud’, the database-as-a-service is built on top of Amazon Web Service API’s. It takes advantage of all recent evolutions in high performance analytics, with close to no overhead or configuration.

In this article, we will explain the full workflow, from staging the data on Amazon Web Services, overloading it into a reporting database, to running the actual queries and analyzing the output. The full process took us merely a few short hours.

Mark Litwintschik is keeping track of an interesting industry benchmark of web-scale data warehousing products, but we felt Snowflake was missing. We decided to use his methodology and apply it to Snowflake.

The test is based on the NYC taxi-rides dataset, a publicly-available corpus containing registrations from every single taxi ride in New York as of 2009. This accounts for a whopping 2 billion records, perfect for putting Snowflake’s Enterprise Edition to the test.

Let’s take a shortcut to prepare the data

The raw data, as provided by the taxi companies, isn’t telling the full story. Analytics would be more valuable if we could annotate the geo data. Mark has an excellent article in which he describes how to processes the raw data using the PostGIS extension for Postgresql. Feel free to read the details over at his blog.

Now, we build a stage on AWS S3

Snowflake’s marketing tells us they’re native to the cloud. It’s clear to us that this starts from the staging step. Our preferred way to load data in Snowflake is to store it in a S3 bucket on an AWS account you control, so you can take advantage of parallel loading. Connecting your Snowflake instance to S3 is as straightforward as it is secure. You create a dedicated IAM role in AWS and grant it cross-account access to your Snowflake instance. The other way around, you’d set up a stage – it looks like any regular database table but is prefixed with an @ – in the Snowflake SQL editor.

create or replace stage aws_taxirides_stage url=’s3://tropos-taxirides-data/stage’ credentials = (aws_role = ‘arn:aws:iam::XXXXXXXXXXX:role/SnowflakeStageRole’);

Mark’s process yields almost half a terabyte of CSV files for us to ingest. We don’t have any use for the row-based data in this case. One thing we did was to convert the raw CSV files at the end into Parquet. We prefer the columnar format for analysis at scale. As Parquet is a splittable file format, we broke the dataset down into 108 chunks of about 800Mb each, so we could use parallel data loading in Snowflake.

Set up a load cluster and run the actual data load

Once the cross-account authorization is set up, we were good to go and load data. The first thing to do is to set up a processing cluster that will read data from your S3 stage and load it into the database. Just like any other maintenance task, you’d run this using a SQL command:

ALTER WAREHOUSE “ETL_WH” SET WAREHOUSE_SIZE = ‘XLARGE’ AUTO_SUSPEND = 5 MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 3 SCALING_POLICY = ‘STANDARD’;

An extra large warehouse costs 16 credits per hour, billed per second. One credit equals a parallel use of 8 cores, so we have 128 cores at our disposal. Please keep in mind the auto_suspend parameter, which turns off the cluster after 5 minutes of inactivity.

Let’s pause here for a second – we’ve just set up an auto-scaling data warehouse on AWS with no further overhead by executing a few SQL commands!

Part of the setup includes loading the parquet data as-is into one column of a database table, that’ll give us the flexibility to work with the document type representation afterward.

use warehouse ETL_WH; create table taxirides.public.TEST ( rawdata variant; ) copy into taxirides.public.TEST from @aws_taxirides_stage FILE_FORMAT = ( type = “PARQUET”);

Command + enter fires off the string of queries, so you can have a quick coffee break while the cluster ingests the data and converts it to Snowflake’s internal micro partition format. The 2.1 billion rows were processed in about 18 minutes.

Screen Shot on 2018-08-17 at 17:48:04.png

An earlier load of the same base data, using a warehouse half the power, loaded in about 36 minutes. That’s almost linear performance scaling, partly of course thanks to the chunked Parquet file.

Screen Shot on 2018-08-12 at 16:09:06.png

Convert the table to a more convenient format

The variant data type is quite a convenient format for data explorers and data scientists to work with, as the original document structure is kept as-is. However, for reporting, we prefer the traditional, column-based table structure.

Converting from the document structure to the column structure comes down to running this query:

CREATE OR REPLACE TABLE ridehistory ( vendor_id string, pickup_datetime timestamp, … ) AS SELECT $1:vendor_id::string, $1:pickup_datetime::timestamp, … FROM taxirides.public.test;

The data load took a few more minutes of processing, this time on a medium cluster.

Screen Shot on 2018-08-12 at 17:22:15.png

And now for the benchmarking

Query 1: A simple count(*)

SELECT cab_type, count(*) FROM “TAXIRIDES”.”PUBLIC”.”RIDEHISTORY” GROUP BY cab_type;

Query 2: An average by group

SELECT passenger_count, avg(total_amount) FROM “TAXIRIDES”.”PUBLIC”.”RIDEHISTORY” GROUP BY passenger_count;

Query 3: A count, grouped by an in-line date function

SELECT passenger_count, year(pickup_datetime), count(*) FROM “TAXIRIDES”.”PUBLIC”.”RIDEHISTORY” GROUP BY passenger_count, year(pickup_datetime);

Query 4: Adding all formulae together

SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM “TAXIRIDES”.”PUBLIC”.”RIDEHISTORY” GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc;

The results

After running all the queries consecutively, we were able to see their stats in the query history browser. This is the result of the first – and only – run of the query on the newly-loaded dataset. This is worth mentioning, as Snowflake caches the output of queries ran for 24 hours by default. Running them again, even the largest, and retrieving the output from cache takes about 130 milliseconds per query.

Screen Shot on 2018-08-12 at 17:30:11.png

CONCLUSION

The major takeaway here is that we build a full data analytics platform dealing with structured and semi-structured data, but used it as-a-service.

For a running cost of less than 10 Euro’s (time to experiment included), we ploughed through 2 billion records and got representative business statistics from it.

And we did it in less than an hour.

Try the test yourself on our database

Through their Data Sharehouse feature, Snowflake allows us to share a secured view of any of our databases with anyone else’s Snowflake account. We’re opening up the integrated version of the taxi-rides dataset we used for this test for querying through your own Snowflake account.

The query below will connect your Snowflake account to our database:

create database tropos_taxirides from share tropos.taxirides_shared;

/* Grant privileges on the database to other roles (e.g. SYSADMIN) in your account. */
grant imported privileges on database tropos_taxirides to sysadmin;

If you don’t currently own a Snowflake account, we can offer a free trial account through our partnership. At the time of writing, the trial runs for 30 days during which you can spend $400 in credits.

Joris Van den Borre

Founder, CEO and solutions architect

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Exclusive! We Are Excited To Be A Dbt Premier Partner in 2023

The how-to guide to interpreting Snowflake's usage-based pricing model.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.