Key Machine Learning annoucements at AWS re:Invent ’17

Amazon announced a heap of new AWS features that will likely impact the way we design and operate machine learning pipelines. These are our 5 key takeaways.
Joris Van den Borre

Re:invent is Amazon Web Services’ annual event held in Las Vegas, USA. Amazon steadily keeps on selling out their event, with the 2017 event spanning over 5 hotels already (The Venetian, Mirage, MGM Grand, Aria and Encore). Amazon typically uses the event to announce new products hosted on its cloud platform and we had high expectations for their machine learning lineup.

I’m summarising my personal notes of what I feel like are the important takeaways for data engineers/machine learning engineers/data scientists from the past event week. I’m assuming you’ve got earlier experiences in the Amazon Cloud and are somewhat familiar with their broad component ecosystem and acronyms.

The TL/DR summary:

Serverless continues trending upwards. The role of serverless microservices remains confirmed and AWS extends their offering with serverless container services and a relational database.
End-to-end development processes seem to move to the cloud. Amazon supports data scientists and data engineers with native options for quality development environments , tightly integrated with code repositories and devops practices.
Amazon offers managed deep learning frameworks such as Tensorflow, MXNET as native components in a development pipeline with automated hyperparameter tuning and managed deployments out-of-the-box

1. CODE DEVELOPMENT: THE RETURN OF CLOUD9.

Cloud9.io was quite a revolutionary coding platform back in its days. I could appreciate its virtualized coding environment as much as their clean and straightforward UI. Personally, I felt let down after they announcement Amazon was acquiring them in July 2016, and the subsequent product takedown

Much to my surprise, Re:Invent brought us the resurrection of Cloud9.io, rebranded as AWS Cloud9. It appears to be deeply integrated with Amazon EC2, AWS CloudFormation and Access Management. What’s great here is that the former sandbox environment that came with Cloud9 is now a fully managed EC2 instance. This gives us the flexibility to install additional packages because we have full control here. It comes with out-of-the-box support for programming languages we use and love such as Python, Java and Go.

Integration with serverless

Furthermore, Cloud9 is tightly integrated with serverless microservices through AWS Lambda. The IDE is capable of writing, running and debugging AWS Lambda functions from within the browser. Engineers push locally developed code to a live environment through a continuous integration pipeline. The current release supports AWS’s own and relatively new CodeStar as a CI/CD platform. Being more of an Atlassian/Semaphore person myself, I am looking forward to broader integration in the near future.

Personally I welcome this release. I think Cloud9 is a strong workbench for at least some aspects of data engineering, that comes with the conveniences of data science notebooks such as Jupyter.

2. COMPUTING: THE “SERVERLESS” TREND CONTINUES.

Container orchestration has always been kind of hassle on AWS. Sure, we had Elastic Container Services running Docker. The Kubernetes platform – initially a project by Google – came with considerable maintenance and setup work.

In a move to catch up with competition, Amazon announced native Kubernetes support, branded as Amazon Elastic Container Service for Kubernetes (EKS). The module is currently in preview. It basically automates the installation, upgrading and high-availability aspects of running an orchestration platform. Furthermore, the platform is built upon open-source, upstream Kubernetes so we can safely use existing plugins and tools borrowed from other cloud platforms.

Additionally, AWS announced serverless container support, branded as AWS Fargate. The technology abstracts provisioning, maintenance and scaling of the current EC2-based architecture for Elastic Container Services. This allows for billing per second during times of peak loads. The announcement promises a technology that scales out to thousands of instances in a matter of seconds. Fargate is currently available for AWS ECS loads, with support for Kubernetes (through EKS) coming up in 2018.

Hybrid Cloud and Multicloud became a whole lot easier overnight. I sure appreciate the freedom for continuous experimentation it offers. We gain considerable time and risk offset by having this as-a-service.

3. DATABASES: INTRODUCING NATIVE GRAPHS AND SERVERLESS RDBMS.

Now here’s a surprise. The announcement of Aurora Serverless (currently in preview) promises a serverless relational database, currently based on MySQL. The database will automatically start up, shut down, and scale up or down capacity based on the application’s needs. In contrast to their standard MySQL offering, called Relational Database Services, setup, maintenance and scaling of EC2 instances will disappear. Some interesting uses for this announcement could be low-volume blogs, test-and acceptation environments and new programs.

Amazon Neptune – currently in preview – extends the platforms capabilities with a dedicated graph database. Use cases include fraud detection, knowledge graphs, drug discovery, and network security amongst others. The database is accessible over Apache Tinkerpop. Now this is a move I personally consider interesting, as we’ve been working with graphs over the past half-year and have gone off-platform to avoid maintenance and setup.

4. END-TO-END MACHINE LEARNING WITH SAGEMAKER

This might easily be my personal favorite from the event. Building, training and deploying machine learning models has always been a manual endeavour at AWS. Sure, we had off-the-shelf machine images for Tensorflow. Until recently, AWS had their strategy pushing MXNET – the deep learning framework behind voice controlled Amazon Alexa. However, we didn’t see a clear path towards integration with their broader platform.

Re:Invent however brings us Amazon SageMaker, a “fully managed service that enables data scientists and developers to quickly and easily build, train, and deploy machine learning models at any scale“. The technology seems to compete with Google’s Cloud ML though without tailored GPU’s or equivalents.

The machine learning platform comes with Jupyter as a front end for quick visualisation and development. It has direct integration links with native AWS data stores such as S3, Mysql, PostgreSQL and Redshift through their own Apache Spark-based.

Sagemaker supports automated hyperparameter tuning for MXNet and Tensorflow supported out-of-the-box, with opportunities to install other frameworks at will. . On infrastructure level, model training scales automatically on NVIDIA GPU‘s. The module deploys models to a managed, autoscaling cluster of EC2 instances with a/b testing support.

This is off course quite the announcement, and hyper relevant for the work we’re doing, so expect a dedicated blog post in the near future.

5. BONUS: DEEP LEARNING ON THE EDGE.

Deeplens is a standalone, smart wireless video camera marketed as an aid for to educate machine learning engineers . The device processes camera input trough a deep learning model by using about all of the technologies discussed before. From what I understood from the talk, the model training phase remains in the cloud whereas models can be deployed either in the cloud or on the device itself, with input capturing for model retraining enabled through Lambda enable REST API’s.

Joris Van den Borre

Founder, CEO and solutions architect

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Exclusive! We Are Excited To Be A Dbt Premier Partner in 2023

The how-to guide to interpreting Snowflake's usage-based pricing model.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.