Creating a regression model in Databricks

Curious on how to do machine learning in Databricks? Selin explains how to predict taxi fares using historical data.
Selin Kanatli

In this article, we will explain the complete workflow of creating a regression model in Databricks and starting with loading the data into Databricks itself, to actually making a regression model.

Our example is based on the NYC taxi-rides dataset, which is a publicly available dataset containing data from every taxi ride in New York since 2009. Since this dataset is ginormous, we will only be using a part of the dataset. For our example, we will be building a very simple regression model based on this subset of the taxi rides dataset. This regression model will be built to predict the fee per ride based on multiple variables.

Databricks

Databricks is a fully managed cloud service that is designed for data analytics. This includes data science, SQL analytics, and much more. It allows for a unified platform where multiple processes of data analytics can come together, while also being very secure. There are numerous partners that can be integrated with Databricks, such as Microsoft, AWS, Tableau, etc. Databricks also allows multiple languages to be used in its workbook. These include Spark SQL, Python, Java, R, and standard SQL. However, in this example, we will be working with AWS and Python.

Regression model

Before starting anything, it is essential to import all your modules. These go from NumPy, to sklearn, to MLFlow. It is also very important to define your experiment. Databricks has a fantastic feature that tracks all your machine learning runs by saving the variables you have defined, but we will come back to this later. It is essential to remember to create and set your experiment in the beginning. If the experiment does not exist, you can just run this code once to make it. You will get an error, but it will create the experiment. Then you just rerun this code to set it. If the experiment does exist, setting it once is enough.

# Set the experiment name to an experiment in the shared experiments folder
mlflow.set_experiment("/Users/[email protected]/Taxi Rides/MLFlow/regression_exp")

Loading data into Databricks

To create our regression model, we have to start with loading the data. The taxi rides dataset can easily be downloaded from the internet and is completely free. For our ease, we uploaded the dataset to an S3 bucket on AWS with partitioning by year and month. This allowed us to load a subset of the data very quickly. The subset we chose was August 2012, this had no specific reason.

The subset of the data could be loaded using the following code. Here, you define your file location (which in our case was an S3 bucket), the file type, and then you can load your data into a variable.

# Loading our august 2012 data subset
file_location = "s3://311701907485-snowflake-landingzone/deltalake/taxirides/year=2012/month=8"
file_type = "delta"
df_2012_8 = spark.read.format(file_type).load(file_location)

This loads your dataset in the same way it was saved in your S3, so it looks for the column names and imports them as well. Since Databricks has its own language called Spark SQL, it also uses its own modules. In this case, since we are using Python to do the programming, Databricks uses its PySpark adapter. Thus, the dataset we loaded will be presented in a PySpark Dataframe.

Cleaning the data

Once your dataset is loaded into Databricks and you can display it, you can start going over it. This is important since you want to know which columns are categorical and continuous variables. You can show your data frame very quickly by running the following code.

# Show dataframe
display(df_2012_8)

This presents your data frame in a very nice table showing the first 1000 rows, if not defined otherwise. You also get a list with the column names and column types.

Once we could look at the column names and the values inside them, we decided which columns would be relevant for our analysis. This was done purely on the reasoning of which columns would possibly predict the total fee. Some columns had no values at all or were just not relevant at all, so these were removed. Finally, we landed on 10 columns that would be relevant for this analysis. So, we created a new data frame that included these 10 columns (i.e., the independent variables) and our dependent variable (i.e., output variable).

# Select specific columns
df = df_2012_8.select("TOTAL_AMOUNT", "VENDOR_ID", "RATE_CODE_ID", "TRIP_DISTANCE", "EXTRA", "MTA_TAX", "TOLLS_AMOUNT", "PASSENGER_COUNT", "PICKUP_BOROCODE", "DROPOFF_BOROCODE", "PRECIPITATION")

In the next step we looked for null values. Since the dataset is so large, we decided to remove rows containing null values.

# Drop the rows with null values
df_1 = df.na.drop()

To clean the data even further, we looked at the distinct values in the columns. In the passenger count column, for example, it was important that this did not equal zero. It is not possible for a fee to be calculated when there were no passengers. So, these were removed as well. We looked at the distinct values of the other columns and the inappropriate values were removed as well.

# Cash tips are not included in the dataset, so these need to be removed from the dataset
df_2 = df_1.where('PAYMENT_TYPE == "CRD"')
# Passengers cannot be 0, since this will influence the analysis, these rows have to be removed
df_2 = df_2.where('PASSENGER_COUNT != 0')
# Trip distance also cannot be 0
df_2 = df_2.where('TRIP_DISTANCE != 0')
# Remove values rate code that seem like a mistake
df_2 = df_2.where('RATE_CODE_ID < 7')
df_2 = df_2.where('RATE_CODE_ID != 0')

Finally, we created plots of the tip amount and fare amount columns; this was done to see whether there were outliers. So, we visually decided on a cutoff for the values and then removed columns with values larger than that value. This made sure that the model wasn’t influenced by outliers too much.

# Remove the extreme tip values
df_2 = df_2.where('TIP_AMOUNT <= 50')
# Remove extreme fare values
df_2 = df_2.where('FARE_AMOUNT <= 200')

Preparing the data for the model

Now that the data is checked and cleaned, we can transform it to the variable types we need. Since vendor ID is a string value, we needed to change it to a value type that would work with linear regression. In this case, we decided to change it to a dichotomous variable, since there were only two vendor IDs. There were some other columns that also consisted of categorical variables, these also had to be dealt with. There are many ways to deal with categorical variables, for example, dummy coding, simple coding, deviation coding…

Since we were working with MLFlow, the data frame had to have a specific structure. In this example, we had to create two arrays; one that contained the independent variables, and one that included the dependent variable.

# Transform dataframe to np.array
X_array = np.array(df_2.select("CMT", "VTS", "RATE_CODE_ID", "TRIP_DISTANCE", "EXTRA", "MTA_TAX", "TOLLS_AMOUNT", "PASSENGER_COUNT", "PICKUP_BOROCODE", "DROPOFF_BOROCODE", "PRECIPITATION").collect())
# Transform dataframe to np.array
Y_array = np.array(df_2.select("TOTAL_AMOUNT").collect())

Creating the model

After transforming the dataset to the desired format and type, we can finally build the model. There are many ways of creating your model, but we decided to use the MLFlow module.

In this example, we started by defining our training model. In this model, we expressed our evaluation metrics, which were root mean squared error, mean squared error, mean average error and R-squared. These measures can be whatever you need them to be. Next, we set a random seed to try to make our results reproducible. Then we split our data into train and test data, here, you can also choose the split percentage. The following step is to create our MLFlow experiment, we put this in a ‘with’ loop to make sure the train_model finishes even if the experiment fails. Here, you define your model (in our case linear regression), and you also log your measures. You can log whatever you want. We decided to log the measures for easy comparison, but also the independent variables in case we wanted to drop or add some. We also decided to log our model. This means that our model is stored in Databricks (this can be any place you define) and this allows you to load your model whenever you want. Thus, you can load your model in a different workbook and run this model on a different dataset or subset if you would like.

# Define model for training
def train_model(X, Y):
    # Evaluate metrics
    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mse = mean_squared_error(actual, pred)
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mse, mae, r2

    np.random.seed(42)

    # Split the data into training and test sets. (0.75, 0.25) split.
    X_train, X_test, y_train, y_test = train_test_split(X, Y)

    # Start an MLflow run; the "with" keyword ensures we'll close the run even if this cell crashes
    with mlflow.start_run():
        lr = linear_model.LinearRegression()
        lr.fit(X_train, y_train)

        predicted_qualities = lr.predict(X_test)
        (rmse, mse, mae, r2) = eval_metrics(y_test, predicted_qualities)

        # Print out ElasticNet model metrics
        print(" RMSE: %s" % rmse)
        print(" MSE: %s" % mse)
        print(" MAE: %s" % mae)
        print(" R2: %s" % r2)

        # Log mlflow attributes for mlflow UI
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("mse", mse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)
        mlflow.sklearn.log_model(lr, "model")
        modelpath = "/dbfs/mlflow/taxirides/model-20"
        mlflow.sklearn.save_model(lr, modelpath)
        mlflow.set_tag('Dataset', 'VENDOR_ID, RATE_CODE_ID, TRIP_DISTANCE, EXTRA, MTA_TAX, TOLLS_AMOUNT, PASSENGER_COUNT, PICKUP_BOROCODE, DROPOFF_BOROCODE, PRECIPITATION')

Interpreting the model

The last step is to look at your results and interpret them. You can easily compare multiple runs in your experiment tab. Databricks stores all your runs with all the measurements you defined in the model.

Since we decided to save our model, we can also load it and run our model on a different subset of the dataset. You can also plot your predicted and actual values to check whether there is correlation.

Summary

To summarise, Databricks is a fantastic cloud-based tool to do machine learning on. You can connect multiple data warehouses, use various languages, and there is a lot of automatization (even more than we used in our experiment!). There is little it cannot do in terms of machine learning.

Selin Kanatli

Analytics Engineer

In just 6 weeks, Jacob had the opportunity to learn and grow through a series of courses designed to equip him with the skills and knowledge necessary to succeed in the data industry.

Revisiting my 6 weeks onboarding training

If you’re working in a hands-on data role using Snowflake, Databricks, or Bigquery, chances are you’ve encountered dbt as a companion technology. 🎉 On April 3rd, 2023, dbt Labs announced that Tropos.io became one of the 5 premier partners worldwide.

Exclusive! We Are Excited To Be A Dbt Premier Partner in 2023

The how-to guide to interpreting Snowflake's usage-based pricing model.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_ZET6HEX39B	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_75663021_2	1 minute	Set by Google to distinguish users.
_gat_UA-75663021-2	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.