Snowflake for BigQuery users - part 3

In part 3/3 on Snowflake for BigQuery users, we consider unstructured data, data masking & data security features and SQL support
Koen Verschaeren

In this series, I’m focussing on the key differences between Snowflake and BigQuery that have an impact on the data pipeline, table design, performance and costs so it is by no means a detailed product comparison or benchmark.

Part 1: virtual warehouses, data storage and data loading/unloading
Part 2: data sharing and multi-cloud capabilities
Part 3: unstructured data, data masking & data security features and SQL support

Semi-structured data

Relational databases and data warehouses were traditionally used for structured data, basically, anything that fits into rows and columns.
This is not the case with modern cloud data warehouses and the latest versions of relational databases. Support for JSON, XML and data stored in modern files, such as AVRO, Parquet and ORC is available.
This is important because it enables that semi-structured data generated by APIs, web tracking, … can be stored and processed in the cloud data warehouse instead of having to transform the data into a row with a fixed schema. Parsing semi-structured data on the fly is often essential to enable a very fast time to value because fewer code changes, releases and data reloads are needed to support new columns.

BigQuery approaches semi-structured data in 2 ways.
It has native support for nested data structured by providing a record like data type that can be repeated. Using UNNEST and ARRAY_AGG it’s possible to flatten and create nested data structures.

Ref. https://cloud.google.com/bigquery/docs/nested-repeated

The second approach is storing the JSON data in a STRING column and using the JSON functions, together with UNNEST to extract and manipulate the content. JSONPath is the only support way to retrieve the keys/values.
BigQuery has no out of the box support for XML but simple XML use-case can be solved using JSON user-defined functions.

Snowflake’s platform currently has no support for nested data structures in tables. Semi-structured data is stored and processed in VARIANT, OBJECT and ARRAY data types. These columns are automatically stored in an efficient optimised format. The FLATTEN table function is used to transform the semi-structured data into rows. Compared to UNNEST in BigQuery FLATTEN has more functionality. The RECURSIVE option can be useful in certain use-cases.
Keys and values can be retrieved using Dot or Bracket Notation and casted to the right data type using ::<Snowflake Datatype>. Snowflake has more SQL functions to manipulate arrays and semi-structured data.

Ref. https://docs.snowflake.com/en/user-guide/json-basics-tutorial-flatten.html

Recently Snowflake introduced support for unstructured data, aka files, on distributed storage in private preview with the GET_PRESIGNED_URL(…) function. This function together with UDFs or external functions enables plenty of exciting ML and data integrations options.

SQL Support

As soon as you start working with Snowflake’s platform you notice that it has been designed with SQL in mind. All the functionality, including configuring external functions, integrating with cloud-specific distributed storage, all user admin can be done with SQL. It’s a very familiar environment for people with a background in relational databases or data warehouse appliances.

Snowflake recently introduced a REST API and Snowpark to enable integration using the most common programming languages. This closes the gap with users that have experience with data science and big data processing frameworks.

The number of SQL functions is larger than BigQuery and includes a number of useful extensions such as MATCH_RECOGNIZE to work with patterns in rows and QUALIFY to filter on the output of window functions.
User-defined functions can be written in Javascript, SQL and Java. Snowflake can call external functions, REST APIs, so

Google BigQuery has a wide range of SQL functions. Javascript and SQL user-defined function support is available. In the last 2 years, BigQuery implemented a lot of DDL/DML functionality.

Both solutions have support for GIS analysis

Data masking & data security

Snowflake and BigQuery offer fine-grained security controls.
It’s easy to grant access to databases or datasets, schemas, tables and views.
Row-level data security (Snowflake/BigQuery) is available in both solutions with a similar approach.

Google currently offers column-based security, so access to columns, for example, due to PII data, can be denied for specific users/user groups. This is managed using policy tags in the data catalog. Data masking using policies is currently not available.

Ref. https://cloud.google.com/bigquery/docs/column-level-security-intro

Snowflake’s approach is different. Instead of focussing on access to columns, the data can be hidden by applying role-based data masking or external tokenisation. A masking policy uses SQL so it is possible to mask structured and semi-structured data.

Ref. https://docs.snowflake.com/en/user-guide/security-column-intro.html#choosing-dynamic-data-masking-or-external-tokenization

Conclusion — What’s the “best” solution

This is the end of the series. I hope I’ve managed to highlight some of the key differences and similarities between Snowflake and BigQuery.

I guess some of you would like to know what’s the best solution. This is not easy to answer. This depends on your cloud strategy, data strategy, budget, the experience of your team, data volumes, types of data, integration with source systems, integration with data visualisation tools, the need for large scale AI, … Both solutions are excellent fully managed solutions that can support the majority of small and large scale data platform use-cases.

In my opinion, Snowflake is currently ahead of BigQuery because it’s natively designed with multi-cloud deployment and multi-cloud data sharing in mind. Data is essential in modern enterprises and a key ingredient for more advanced ML models, new business models and faster cross-company collaboration. So seamless, secure, no fuss data sharing using Snowflake is and will be a game-changer as soon as people notice they can avoid plenty of additional infrastructures, tools and API layers while maintaining security and audibility.

Ref. https://docs.snowflake.com/en/user-guide/data-exchange-benefits.html

Snowflake is a great fit for teams with a background in “ old skool” data warehouse appliances and modern data teams that want to run data transformations and analytics in the data layer instead of relying on distributed storage and distributed data processing frameworks.

In certain environments, I wouldn’t mind that Snowflake and BigQuery coexist. It would make absolute sense to keep detailed Google Analytics data in BigQuery and sync aggregates to Snowflake to combine the data with a bunch of CRM/ERP and other data sources.
In a data mesh architecture domains can run on different services as long as the data is accessible, quality controlled and documented. Modern data tools tend to support both platforms.

Get in touch if you have any questions or suggestions! At Tropos.io, as a Snowflake partner, we can help your team with designing a cost-efficient, secure and flexible data pipeline.

Koen Verschaeren

Solutions Architect

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Exclusive! We Are Excited To Be A Dbt Premier Partner in 2023

The how-to guide to interpreting Snowflake's usage-based pricing model.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.