Apache SuperSet: our review of the popular open-source data visualization platform

Our current take on Apache Superset; and why we think it is a viable alternative to PowerBI or Tableau at scale.
Koen Verschaeren

With the Tropos.io data stack radar, we keep track of several open-source projects and commercial tools. For each category, we would like to have multiple options to adapt the data stack to the customer’s use cases, scale, and budget. Before we add a solution to our tech radar, we assess it to ensure we can successfully scale and secure it in production with minimal risk. For this blog post, we’ll dive into Apache Superset. A popular open-source project for data visualization that’s actually competing with Tableau.

Technical Architecture

Since Apache Superset is already in production with many users, including at Airbnb, where the project started in 2016, the architecture is well throughout and scalable.
The web servers that serve the UI can scale entirely independently using a load balancer and scaling group or policy from the workers who execute the SQL queries. Using a configurable cache based on Redis, not all the queries triggered by the charts on the dashboards have to hit the (cloud) data warehouse.
It’s relatively easy to spin all this up using docker-compose or the helm chart. The documentation can be improved by a few examples of the reference architecture on Azure, AWS, or GCP with the metadata running on a managed database in Terraform or Pulumi. Oauth2 configuration documentation, often a requirement in large enterprises, is available.

Apache Superset systems architecture — Ref. https://www.startdataengineering.com/post/apache-superset-tutorial/

SQLAlchemy, the framework used by Superset to connect to the data sources, has battle-tested connectors to all major databases and cloud data warehouses.
From a Snowflake point of view, it’s possible to set a default warehouse and default role in the connection string.
We want the capability to authenticate with OAuth to leverage role-based security and data masking fully.

Apache Superset’s visualization capabilities

Since the move to Echarts, the number of great-looking chart types keeps expanding, and more tutorials on designing your charts appear online and in the Slack community.
From a GIS point of view, it’s a great deck.gl visualizations are included. You need to add MAPBOX_API_KEY = <your key> to your configuration file (config.py or superset_config.py).

The UI is not yet drag-drop, but this is not a show stopper because the drop-downs are intuitive. For the most common chart types tweaking JSON is not necessary.

Data sources are transformed into “datasets” that act as a basic semantic layer, so it’s easy to add friendly field names, descriptions, calculated columns, and even some data governance-related fields.

Security

In several open-source data visualization frameworks, such as Plotly Dash, it’s feasible to create a great-looking dashboard, but configurable role-based access and row-based security are not available. These are features that are not that easy to develop from scratch.
Apache Superset supports these security features out of the box.
The names Gamma & Alpha of the default roles are a bit weird.
Users with role Alpha only have access to specific data sources and charts/dashboards based on these sources. These are dashboard viewers or readers. Gamma users are the dataset/chart/dashboard creators. They can only create new artifacts and edit the artifacts they own.
Let’s look into an example.
I’ve created a role to limit access to a specific data source.

Then I’ve created a dashboard user and dashboard admin.

The dashboard user only has access to the covid dashboard and dataset.

The admin user can edit the Covid dataset.

It’s not possible to edit other datasets.

Custom roles offer very fine-grained control over the functionality exposed to users and the API layer.

Row-level security is available, but you have to enable the feature in version <V1.2.0.

Apache Superset lacks several features available in commercial visualization tools such as folders and groups to organize datasets, charts, and dashboards. In projects with a manageable number of users, this is not a show-stopper, and the metadatabase is accessible, so a workaround is feasible.

Preset.io

As with other tools on our data stack radar, a fully managed service by the founders of the open-source projects will be available.

We’ve been in touch with Preset.io, and a fully managed Apache Superset with a free tier will be available publicly in the coming months.
We can’t confirm the pricing yet, but the estimate we got positions a preset.io user (with full access to all features) between the reader accounts and dashboard creator account pricing seen at several competitors 🙂
The launch pricing might be an issue for use-cases with a few dashboard creators and a large number of dashboard viewers. As with any SaaS solution negotiating a better customer-specific deal is an option.

SSO/Idp integration will be part of the entry-level tier, and we were amazed that it would be possible to spin up the service in European cloud locations on AWS/Azure/GCP. More advanced decks with other features such as AWS Private Link will be available. A migration path from and to the open-source version will be provided.

Next to a rename of the roles, preset.io adds the concept of separate workspaces to manage groups of users and dashboards.

Conclusion

We hope this blog post gives you a view of why we “upgraded” Apache Superset in our data stack radar. With this progress and the great community behind it, we are confident to use Apache Superset in several projects.
The free preset.io tier will be a valuable deployment option if you don’t want to regularly manually test, secure, and upgrade a self-managed Apache Superset.

So spin-up Apache Superset if you are looking for a cloud-native, customizable and user-friendly data visualization solution and get in touch if you need support.

Koen Verschaeren

Solutions Architect

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Exclusive! We Are Excited To Be A Dbt Premier Partner in 2023

The how-to guide to interpreting Snowflake's usage-based pricing model.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.