Snowflake for BigQuery users - Part 2

In part 2/3 on Snowflake for BigQuery users, we consider data sharing and multi-cloud capabilities
Koen Verschaeren

In this series, I focus on the critical differences between Snowflake and BigQuery that impact the data pipeline, table design, performance, and costs. Hence, it is by no means a detailed product comparison or benchmark.

Part 1: virtual warehouses, data storage and data loading/unloading
Part 2: data sharing and multi-cloud capabilities
Part 3: unstructured data, data masking & data security features and SQL support

Account hierarchy

The way data is organized very similarly.
In a Snowflake environment, you create one or more Snowflake accounts. A Snowflake account is located in 1 of the major cloud providers (AWS/Azure/GCP) in a particular region.
A Snowflake account is a container for one or more databases with more schemas containing tables views… This is very similar to OLTP databases. A Snowflake organization sits above all the Snowflake accounts.

Ref. https://docs.snowflake.com/en/user-guide/data-lifecycle.html#lifecycle-diagram

BigQuery on GCP is a fully managed service that integrated seamlessly in the GCP resource hierarchy in your GCP account.
As soon as the Google BigQuery API has been enabled for your project and granted BigQuery roles and permissions in Cloud IAM, you can create one or more BigQuery datasets. A dataset is located in a multi-region, for example, the EU or a more specific region. BigQuery hasn’t got the concept of schemas.

Having multiple datasets in different regions in a project in BigQuery can be an advantage. Still, it can be a bit tricky because you need to run the BigQuery jobs, aka queries, from the same region. Furthermore, in the EU, it’s very likely a policy will be in place that data can only be located in the EU multi-region or EU regions.

Data federation & multi-cloud data sharing capabilities

BigQuery can access data in Google Storage buckets, MySQL, and Postgres data if these databases are running in CloudSQL and BigTable, a Google-managed Hbase compatible wide-column store. Reading from CloudSQL and BigTable is a powerful feature because it’s possible to write a large part of the data pipeline in SQL.

Google Cloud IAM and BigQuery are global services, so it’s possible to grant access, full access or read-only, to datasets or specific tables/views to Cloud Identities outside of your organization. This can be prevented by configuring a policy. It is recommended to use authorized views to limit access to the underlying tables.
The data storage costs are billed to the project the datasets belong to. The query costs will be billed to the project that queries the data.

Since mid-2020, Google Omni has been in preview; this brings a managed version of the BigQuery query engine to Azure and AWS, so the data doesn’t need to move to GCP. Public pricing and detailed features are not yet available.
Recently Google announced several upcoming new services, such as Dataplex and Analytics Hub, with additional data sharing and monitoring capabilities.

https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-omni

Snowflake’s platform has been natively designed with multi-cloud and data sharing in mind, and this clearly shows as soon as you go into the details.

Data needs to be pushed to Snowflake using internal or external stages.
A stage is Snowflake terminology for a distributed storage bucket. The difference with BigQuery is that a Snowflake account can load/unload data, and create external tables, on external stages located in Azure/AWS or GCP. The IAM of the cloud provider handles the access to external stages, so fine-grained control and customer-managed encryption are available.
This functionality enables many application scenarios because large companies use multiple cloud platforms. Hence, accessing data from other internal departments or external parties in other clouds with more minor error-prone data movements is a game-changer.

Ref. https://docs.snowflake.com/en/user-guide/data-load-overview.html#supported-file-locations

Secure data sharing is another core functionality of Snowflake.
An account admin, or anybody with the proper permissions, can act as a data provider and share tables, secured views, and secured functions.
Then it’s up to the data consumer to configure the share, so the data appears as a read-only database. If you want to buy or monetize data, this can be done as a secure direct share via a configurable data exchange portal or the public Snowflake data marketplace. Data sharing leverages the cloud metadata layer to grant access to the data in storage, so no data is copied.
If the data needs to be shared cross-region or cross-cloud provider, it’s only a matter of configuring automatic data replication.
The data providers cover data storage costs, data consumption, warehouse credits, and billed to the data consumer. Reader accounts can be created to provide access to other parties that don’t have a Snowflake account.

Ref. https://docs.snowflake.com/en/user-guide/data-sharing-intro.html

The secure data sharing capabilities of the Snowflake, The Data Cloud, enables plenty of internal and external use-cases.
In the actual “data mesh” fashion, domains can offer fully documented, up-to-date, and governed data products to other internal and external domains or stakeholders.

Ref. https://docs.snowflake.com/en/user-guide/data-exchange-benefits.html

Data Cloning

In BigQuery, it’s possible to use CREATE TABLE AS SELECT … to copy data into a new table quickly.
In Snowflake’s platform, this is possible, but often a better alternative is “zero-copy cloning.” Instead of duplicating the data, the Snowflake metadata will point to the existing micro-partitions.
Further updates to the cloned data will generate new micro-partitions.
In the end, cloning is faster and less costly because less data is stored and needs to be manipulated.

This is powerful functionality if you need to create development environments or quickly troubleshoot or fix data without any impact on the base tables. Depending on what you clone, the database/schema or table, grants are by default not copied. Hence, it’s also easy to mask the data in development environments by specifying the most appropriate masking policy.

Conclusion

From a data sharing point of view, Snowflake is a proven Data Cloud available on three major cloud providers with automatic global scale secure data sharing.
Google BigQuery, and other competitors, are closing the gaps by introducing multi-cloud capabilities and data transfer services to keep distributed storage and tables in sync between regions and clouds. It is less automatic than Snowflake’s platform’s data sharing and cloning capabilities.

The Snowflake and BigQuery data marketplace growth and the number of data shares mentioned in the latest Snowflake summit indicate that specific industries are ready to embrace this. Image a world without fragile data pipelines that do barely more than copying stale data between distributed storage accounts or even worse…SFTP servers 😃
This will free up resources in data teams to focus on helping the business to generate more business value instead of spending a large chunk of the budget on only moving data around. Easier ways to offer data monetization at scale will be a new opportunity for industries.

In the next and last blog post of this series, we will look into how BigQuery and Snowflake handle semi-structured data, data security, and SQL.

Get in touch if you have any questions or suggestions! At Tropos.io, as a Snowflake partner, we can help your team design a cost-efficient, secure, and flexible data pipeline.

Koen Verschaeren

Solutions Architect

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Exclusive! We Are Excited To Be A Dbt Premier Partner in 2023

The how-to guide to interpreting Snowflake's usage-based pricing model.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.