DataOps basics for Snowflake — part 1

Job automation is key to efficient data engineering. Here are a few tips to use Snowflake features to built user acceptance test environments.
Joris Van den Borre

Automate testing and releasing for data products

Once you get more serious about Snowflake, you may want to optimize how you organize yourself and your team. In our engineering practice, we have noticed that up to 20% of an engineer’s time is spent doing release management. The definition of DataOps – optimizing data engineering and software operations work in one role – aims to address the productivity challenge. Mainly, if one wants to deploy models to UAT and production environments, you may meet some new concepts in Snowflake for the first time.

“SDLC“
software development life cycle, aka the practice of building, testing, and releasing software projects in separate environments.

Operationalization used to be slow and manual

We saw the same pattern in any other kind of data technology we used to work with before.

The BI team wants to get a data refresh of their dev environment;
The sysops team restores a backup of their production database;
The sysops team runs some scripts to remove sensitive data from production;
The BI team starts developing;

And the same pattern repeats — usually with some minor variations — for the user acceptance environment.

Borrowing software engineering’s best practices

One big challenge with the — let’s call it traditional — scenario above is that it is hard to govern and inefficient what teams do with their time. After all, copying data over between servers isn’t adding much business value.

Snowflake — the data cloud — offers a new perspective on this practice. As we are working with fully cloud-native technology, we’re not blocked by the constraints — such as expensive storage and physically constrained technology capabilities — that made our SDLC process less efficient in the early days.

Snowflake’s zero-copy-cloning personalizes dev environments

In their underlying data model, Snowflake doesn’t store data as complete sets but rather as assemblies of changes that occurred over time. This is meaningful to us for multiple reasons, of which SDLC management is just one.

Timetravel — as this timeline-based approach is branded — allows us to create a virtual copy of a database, schema, or table without making a physical copy of the data set.

Neat. With this feature, we can create development and UAT environments using a simple SQL statement that replaces the whole backup/restore workflow outlined in the first paragraph.

create database my_dev_db clone my_prd_db;

Using GIT to manage all lifecycle processes

Now ample opportunity is in automating the clone creation process when things happen during project development. Examples of such moments are :

An engineer picks up a new chunk of work (an agile “story”) and wants to develop the functionality separately from anything anyone else is doing;
The team prepares for a demo and wants to merge all of their functionality into a “release candidate”;
The product owner or client approves the work that has been delivered and wants to promote the changes to a production environment;

All of these moments can be defined as events in a code versioning system, such as Github, Gitlab, or BitBucket.

More specifically, every level of code isolation (“my sandbox, our sandbox, my organization’s Snowflake tenant”) can be its own branch — or separately stored and managed code version — in Git’s terminology.

And here’s where everything merges. Our CI (continuous integration) product will define what script to execute once a branch is created, deleted, or some code change is submitted to one. And you don’t need to pick a fancy, expensive CI product as Github, Gitlab, and Bitbucket include their own.

By using the Git platform in a structural way, and setting clear standards for a team’s workflow, we’re able to manage the full release management process for even large Snowflake projects right out of our Git platform.

Depending on the branch type, we will use the zero-copy-clone feature to create an appropriate database object in Snowflake:

Creating a new feature database, only accessible to the engineer who picked up a task and secured from anyone else’s development activities. Unit testing — the action of validating the quality of a single unit of work — can happen here;
Creating a new release candidate database, where all of the individual contributions land altogether and can be validated as if it were a release to production
Releasing tested and validated changes to production, even without downtime for your data application. Even if you’re dealing with petabytes of data and complex transformations during release.

And this is just the starting point. In a real-life scenario, one would maybe link Jira to Snowflake so that every time an engineer starts working on a new agile story,

So how to take it from here?

This article is just a short introduction to how a proper release strategy, team setup, Snowflake, and just the right amount of automation can supercharge an analytics engineering team’s velocity and output quality.

A real-life scenario would start on this but add — potentially many — layers of quality control and data validation on top.

One of our most essential learnings in this domain is to keep things simple and structured, which might not be easy if you just start wandering through the forests of a modern data stack. Just get in touch if you’d want to share some thoughts on the concepts.

Stay tuned for parts 2 and 3, where we’ll dig deeper into the dynamics of automating tedious but critical jobs in our release cycles.

Joris Van den Borre

Founder, CEO and solutions architect

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Exclusive! We Are Excited To Be A Dbt Premier Partner in 2023

The how-to guide to interpreting Snowflake's usage-based pricing model.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.