Creating a regression model in Databricks

In this article, we will explain the complete workflow of creating a regression model in Databricks and starting with loading the data into Databricks itself, to actually making a regression model.

Our example is based on the NYC taxi-rides dataset, which is a publicly available dataset containing data from every taxi ride in New York since 2009. Since this dataset is ginormous, we will only be using a part of the dataset. For our example, we will be building a very simple regression model based on this subset of the taxi rides dataset. This regression model will be built to predict the fee per ride based on multiple variables.

Databricks

Databricks is a fully managed cloud service that is designed for data analytics. This includes data science, SQL analytics, and much more. It allows for a unified platform where multiple processes of data analytics can come together, while also being very secure. There are numerous partners that can be integrated with Databricks, such as Microsoft, AWS, Tableau, etc. Databricks also allows multiple languages to be used in its workbook. These include Spark SQL, Python, Java, R, and standard SQL. However, in this example, we will be working with AWS and Python.

Regression model

Before starting anything, it is essential to import all your modules. These go from NumPy, to sklearn, to MLFlow. It is also very important to define your experiment. Databricks has a fantastic feature that tracks all your machine learning runs by saving the variables you have defined, but we will come back to this later. It is essential to remember to create and set your experiment in the beginning. If the experiment does not exist, you can just run this code once to make it. You will get an error, but it will create the experiment. Then you just rerun this code to set it. If the experiment does exist, setting it once is enough.

# Set the experiment name to an experiment in the shared experiments folder
mlflow.set_experiment("/Users/[email protected]/Taxi Rides/MLFlow/regression_exp")

Loading data into Databricks

To create our regression model, we have to start with loading the data. The taxi rides dataset can easily be downloaded from the internet and is completely free. For our ease, we uploaded the dataset to an S3 bucket on AWS with partitioning by year and month. This allowed us to load a subset of the data very quickly. The subset we chose was August 2012, this had no specific reason.

The subset of the data could be loaded using the following code. Here, you define your file location (which in our case was an S3 bucket), the file type, and then you can load your data into a variable.

# Loading our august 2012 data subset
file_location = "s3://311701907485-snowflake-landingzone/deltalake/taxirides/year=2012/month=8"
file_type = "delta"
df_2012_8 = spark.read.format(file_type).load(file_location)

This loads your dataset in the same way it was saved in your S3, so it looks for the column names and imports them as well. Since Databricks has its own language called Spark SQL, it also uses its own modules. In this case, since we are using Python to do the programming, Databricks uses its PySpark adapter. Thus, the dataset we loaded will be presented in a PySpark Dataframe.

Cleaning the data

Once your dataset is loaded into Databricks and you can display it, you can start going over it. This is important since you want to know which columns are categorical and continuous variables. You can show your data frame very quickly by running the following code.

# Show dataframe
display(df_2012_8)

This presents your data frame in a very nice table showing the first 1000 rows, if not defined otherwise. You also get a list with the column names and column types.

Once we could look at the column names and the values inside them, we decided which columns would be relevant for our analysis. This was done purely on the reasoning of which columns would possibly predict the total fee. Some columns had no values at all or were just not relevant at all, so these were removed. Finally, we landed on 10 columns that would be relevant for this analysis. So, we created a new data frame that included these 10 columns (i.e., the independent variables) and our dependent variable (i.e., output variable).

# Select specific columns
df = df_2012_8.select("TOTAL_AMOUNT", "VENDOR_ID", "RATE_CODE_ID", "TRIP_DISTANCE", "EXTRA", "MTA_TAX", "TOLLS_AMOUNT", "PASSENGER_COUNT", "PICKUP_BOROCODE", "DROPOFF_BOROCODE", "PRECIPITATION")

In the next step we looked for null values. Since the dataset is so large, we decided to remove rows containing null values.

# Drop the rows with null values
df_1 = df.na.drop()

To clean the data even further, we looked at the distinct values in the columns. In the passenger count column, for example, it was important that this did not equal zero. It is not possible for a fee to be calculated when there were no passengers. So, these were removed as well. We looked at the distinct values of the other columns and the inappropriate values were removed as well.

# Cash tips are not included in the dataset, so these need to be removed from the dataset
df_2 = df_1.where('PAYMENT_TYPE == "CRD"')
# Passengers cannot be 0, since this will influence the analysis, these rows have to be removed
df_2 = df_2.where('PASSENGER_COUNT != 0')
# Trip distance also cannot be 0
df_2 = df_2.where('TRIP_DISTANCE != 0')

# Remove values rate code that seem like a mistake
df_2 = df_2.where('RATE_CODE_ID < 7')
df_2 = df_2.where('RATE_CODE_ID != 0')

Finally, we created plots of the tip amount and fare amount columns; this was done to see whether there were outliers. So, we visually decided on a cutoff for the values and then removed columns with values larger than that value. This made sure that the model wasn’t influenced by outliers too much.

# Remove the extreme tip values
df_2 = df_2.where('TIP_AMOUNT <= 50')
# Remove extreme fare values
df_2 = df_2.where('FARE_AMOUNT <= 200')

Preparing the data for the model

Now that the data is checked and cleaned, we can transform it to the variable types we need. Since vendor ID is a string value, we needed to change it to a value type that would work with linear regression. In this case, we decided to change it to a dichotomous variable, since there were only two vendor IDs. There were some other columns that also consisted of categorical variables, these also had to be dealt with. There are many ways to deal with categorical variables, for example, dummy coding, simple coding, deviation coding…

Since we were working with MLFlow, the data frame had to have a specific structure. In this example, we had to create two arrays; one that contained the independent variables, and one that included the dependent variable.

# Transform dataframe to np.array
X_array = np.array(df_2.select("CMT", "VTS", "RATE_CODE_ID", "TRIP_DISTANCE", "EXTRA", "MTA_TAX", "TOLLS_AMOUNT", "PASSENGER_COUNT", "PICKUP_BOROCODE", "DROPOFF_BOROCODE", "PRECIPITATION").collect())
# Transform dataframe to np.array
Y_array = np.array(df_2.select("TOTAL_AMOUNT").collect())

Creating the model

After transforming the dataset to the desired format and type, we can finally build the model. There are many ways of creating your model, but we decided to use the MLFlow module.

In this example, we started by defining our training model. In this model, we expressed our evaluation metrics, which were root mean squared error, mean squared error, mean average error and R-squared. These measures can be whatever you need them to be. Next, we set a random seed to try to make our results reproducible. Then we split our data into train and test data, here, you can also choose the split percentage. The following step is to create our MLFlow experiment, we put this in a ‘with’ loop to make sure the train_model finishes even if the experiment fails. Here, you define your model (in our case linear regression), and you also log your measures. You can log whatever you want. We decided to log the measures for easy comparison, but also the independent variables in case we wanted to drop or add some. We also decided to log our model. This means that our model is stored in Databricks (this can be any place you define) and this allows you to load your model whenever you want. Thus, you can load your model in a different workbook and run this model on a different dataset or subset if you would like.

# Define model for training
def train_model(X, Y):
    # Evaluate metrics
    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mse = mean_squared_error(actual, pred)
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mse, mae, r2

    np.random.seed(42)

    # Split the data into training and test sets. (0.75, 0.25) split.
    X_train, X_test, y_train, y_test = train_test_split(X, Y)

    # Start an MLflow run; the "with" keyword ensures we'll close the run even if this cell crashes
    with mlflow.start_run():
        lr = linear_model.LinearRegression()
        lr.fit(X_train, y_train)

        predicted_qualities = lr.predict(X_test)
        (rmse, mse, mae, r2) = eval_metrics(y_test, predicted_qualities)

        # Print out ElasticNet model metrics
        print(" RMSE: %s" % rmse)
        print(" MSE: %s" % mse)
        print(" MAE: %s" % mae)
        print(" R2: %s" % r2)

        # Log mlflow attributes for mlflow UI
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("mse", mse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)
        mlflow.sklearn.log_model(lr, "model")
        modelpath = "/dbfs/mlflow/taxirides/model-20"
        mlflow.sklearn.save_model(lr, modelpath)
        mlflow.set_tag('Dataset', 'VENDOR_ID, RATE_CODE_ID, TRIP_DISTANCE, EXTRA, MTA_TAX, TOLLS_AMOUNT, PASSENGER_COUNT, PICKUP_BOROCODE, DROPOFF_BOROCODE, PRECIPITATION')

Interpreting the model

The last step is to look at your results and interpret them. You can easily compare multiple runs in your experiment tab. Databricks stores all your runs with all the measurements you defined in the model.

Since we decided to save our model, we can also load it and run our model on a different subset of the dataset. You can also plot your predicted and actual values to check whether there is correlation.

Summary

To summarise, Databricks is a fantastic cloud-based tool to do machine learning on. You can connect multiple data warehouses, use various languages, and there is a lot of automatization (even more than we used in our experiment!). There is little it cannot do in terms of machine learning.

Selin Kanatli
Selin Kanatli
Analytics Engineer

Related articles