Snowflake for BigQuery users – part 1

In part 1/3 on Snowflake for BigQuery users, we consider compute, data storage and data loading/unloading
Koen Verschaeren

I joined Tropos.io to focus on data in the cloud with a modern data stack.
For the last 3 years, I’ve been working in Google Cloud environments with BigQuery as the main tool for data lake, data engineering, data warehousing and data sharing workloads. At Tropos.io, depending on the requirements, these workloads are often developed using Snowflake The Data Cloud on the cloud platform selected by the customer.

In this series, I’m focussing on the key differences between Snowflake and BigQuery that have an impact on the data pipeline, table design, performance and costs so it is by no means a detailed product comparison or benchmark.

Part 1: compute, data storage and data loading/unloading
Part 2: data sharing and multi-cloud capabilities
Part 3: unstructured data, data masking & data security features and SQL support

Data storage & compute

Both platforms store data in native tables in proprietary formats on an internal storage layer.
From a data storage point of view, the main difference is that BigQuery bills the number of bytes stored uncompressed while Snowflake charges the compressed amount. Pricing per TB for both solutions is close to distributed storage pricing. Depending on the data types and data the difference between uncompressed and compressed can add up to a 60–80 % difference.
Google BigQuery charges less for partitions or tables that have not been updated in the last 90 days while the data remains immediately available. Snowflake charges the current storage and any other storage by other features such as time travel and fail-safe. Transient tables can be used to avoid fail-safe storage costs but make sure you have the data available in another service if a reload is required.
Snowflake and BigQuery can access data in distributed storage as external tables. For most workloads, the data storage costs are a small part of the running costs so try to avoid premature optimization.

Snowflake and BigQuery fully separate storage and compute. The way compute is exposed and billed is very different.

On the Snowflake platform, each SQL query runs in a “virtual warehouse”. A virtual warehouse, expressed in a “T-shirt size”, is a cluster with a number of servers. The number of servers in a cluster double for each T-shirt size.

Ref. https://docs.snowflake.com/en/user-guide/warehouses-overview.html#warehouse-size

The size of a virtual warehouse has an impact on the number of files that can be processed and query execution performance.
Query performance scales linearly but as in any distributed system, this depends a lot on the data volumes, query operations and the number of partitions you are processing. Snowflake tends to feel a bit faster on small and mid-sized tables with queries that execute in less than a second.
If you want to increase query concurrency, Snowflake can easily scale out by adding a number of extra clusters. Billing is done per second, with a 1-minute minimum.
Virtual warehouses start in milliseconds so it is possible and recommended to use auto-resume and auto-suspend. The number of credits you burn depends on the warehouse size and the time the warehouse has been spun up.

BigQuery with the default on-demand pricing is fully billed based on the number of uncompressed bytes each query processes. The available compute is based on a fair scheduling algorithm. Each GCP project has access, by default, to about 2000 query slots. A slot is an amount of compute/memory available for query processing.

Ref. https://cloud.google.com/bigquery/docs/slots#fair_scheduling_in_bigquery

If your BigQuery workloads are too expensive with the on-demand pricing model or you want to have predictable costs, it’s possible to switch to slot-based flat-rate pricing. You can reserve and assign a number of slots to a number of projects permanently or assign flexible slots for a shorter duration, with a minimum of 1 minute. Check the flat-rate pricing for more information.

Snowflake forces you to manually size and assign a virtual warehouse.
This capability is a game-changer in real life because you can easily separate different workloads, for example, a long-running data transformation job versus ad-hoc light dashboard queries, by assigning different warehouse sizes and auto-scaling multi-cluster warehouses for each connection. Warehouses can be started and suspended using SQL so it’s easy to integrate into any data pipeline. Check out our blog post about right-sizing your virtual warehouse size.

In BigQuery you can improve query concurrency by setting separate projects and enable predictable costs and performance by provisioning a specific number of slots to optimize query costs. 2000 slots commitments for each project to achieve on-demand pricing performance and query concurrency can be an issue from a cost point of view.
Running a certain workload in fixed-price mode is possible with flexible slots but I’ve not seen this for operational/management reporting. The current reporting tools don’t support calling the REST API before running a set of SQL queries. This is feasible for batch data pipelines by using a job orchestrator to call the reservation API before/after the query.

Query optimization & caching

The differences in pricing, probably triggered by architecture choices in the storage and query layer, have an impact on the performance optimization features.

Snowflake aims to be as hands-off as possible. In most use-cases, on tables smaller than a few TBs, no manual optimization is required. For large tables, clustering keys can be specified if the default way of working doesn’t match the type of queries. Filters on columns with high cardinality can be optimized using the search optimization service. Auto-reclustering is automatically done in the background.

BigQuery requires a bit more upfront design for large tables. It’s essential to adapt the table partitioning and clustering to the most popular queries to keep the number of bytes that are scanned under control. Additional bytes can be avoided by nesting repeated data. The pitfall is that the majority of the reporting tools, except Looker, don’t support nested data structures and the SQL statements can be complex.

Caching in Snowflake is different from BigQuery.
Queries are cached in the virtual warehouse layer as long as the warehouse is running. Tuning the auto-suspend parameter of the virtual warehouse can improve queries that access similar data. More important is the fact that Snowflake manages the query result cache on the account level. If the underlying data has not been changed, queries can be returned without running a virtual warehouse for up to 24 hours.
In BigQuery caching is done on user level. More advanced caching can be configured using BI Engine.

Data loading and unloading

Both solutions offer similar ways to load data with the main differences that loading data into BigQuery in batch is free. Streaming data ingesting is possible using a REST API out of the box or with the upcoming BigQuery Storage Write API billed at a rate per MB or GB.

In Snowflake data ingestion is not free but plenty of options are available:

Snowpipe — micro-batching with billing per second instead of virtual warehouse pricing
COPY — batch data ingesting from internal & external stages that require a virtual warehouse
Apache Kafka connector

Data unloading in BigQuery is free by using the shared compute pool or $1.1 TB via the BigQuery Storage API.
In Snowflake data unloading is done using the COPY statement so a virtual warehouse is required.

Since the latest Snowflake release support for data formats and automatic schema detection for ORC, Parquet and AVRO are very similar except that BigQuery has no out of the box XML support.

Other costs & cost controls

Next to data storage, virtual warehouses, serverless features such as the use of Snowpipe, Snowflake invoices the use of cloud services, if these exceed 10 % of the daily compute costs and any egress costs (if data travels between regions or cloud providers). A detailed view of the costs is available in the web UI and the account_usage database.

BigQuery invoicing is slightly less complex because optimizations that automatically run in the background, time travel storage, the use of the metadata and other platform costs are not charged. A detailed view of the costs per query and the slot reservations is available in the audit tables and billing exports. The UI, API and command-line tools can estimate the costs of a query before it’s executed.

Both Snowflake and BigQuery offer enough measures, such as resource monitors, to monitor costs and set maximum budgets.

Conclusion

Coming from BigQuery, it’s essential to understand the way virtual warehouses can be used in Snowflake to optimize the costs and performance of different workloads. If the data size and the number of queries is small, this is a bit more work compared to BigQuery in the on-demand pricing model.

In environments with a lot of queries generated by reporting tools, a number of data engineering tasks that run frequently together with less frequent but unplanned data discovery and data preparation workloads for AI, the separation into separate virtual warehouses is a very welcome feature with an immediate impact on user experience and cost management.

Data ingestion in Snowflake is not as real-time and API-driven as BigQuery but enough features, such as Snowpipe, are available to cover most workloads at an affordable cost. At Tropos.io, as a Snowflake partner, we can help your team with designing a cost-efficient, secure and flexible data pipeline.

In the next blog post, I’ll focus on data sharing and multi-cloud capabilities.

Koen Verschaeren

Solutions Architect

One of the most considerable challenges for a data platform owner today is upgrading their data platform infrastructure. We found a way to automate the conversion of several legacy technologies to Snowflake by autoconverting them to dbt projects. Here’s how we did it.

How We Accelerate Hadoop-to-Snowflake Migrations

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.