In this lab, we'll see how to apply regression analysis using CART trees while making use of some hyperparameter tuning to improve our model.
In this lab you will:
- Perform the full process of cleaning data, tuning hyperparameters, creating visualizations, and evaluating decision tree models
- Determine the optimal hyperparameters for a decision tree model and evaluate the performance of decision tree models
The dataset is available in the file 'ames.csv'
.
- Import the dataset and examine its dimensions:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
# Load the Ames housing dataset
data = None
# Print the dimensions of data
# Check out the info for the dataframe
# Show the first 5 rows
In this lab, we will use using 3 predictive continuous features:
LotArea
: Lot size in square feet1stFlrSF
: Size of first floor in square feetGrLivArea
: Above grade (ground) living area square feet
-
SalePrice
', the sale price of the home, in dollars -
Create DataFrames for the features and the target variable as shown above
-
Inspect the contents of both the features and the target variable
# Features and target data
target = None
features = None
- Use scatter plots to show the correlation between the chosen features and the target variable
- Comment on each scatter plot
# Your code here
- Import
r2_score
andmean_squared_error
fromsklearn.metrics
- Create a function
performance(true, predicted)
to calculate and return both the R-squared score and Root Mean Squared Error (RMSE) for two equal-sized arrays for the given true and predicted values- Depending on your version of sklearn, in order to get the RMSE score you will need to either set
squared=False
or you will need to take the square root of the output of themean_squared_error
function - check out the documentation or this helpful and related StackOverflow post - The benefit of calculating RMSE instead of the Mean Squared Error (MSE) is that RMSE is in the same units at the target - here, this means that RMSE will be in dollars, calculating how far off in dollars our predictions are away from the actual prices for homes, on average
- Depending on your version of sklearn, in order to get the RMSE score you will need to either set
# Import metrics
# Define the function
def performance(y_true, y_predict):
"""
Calculates and returns the two performance scores between
true and predicted values - first R-Squared, then RMSE
"""
# Calculate the r2 score between 'y_true' and 'y_predict'
# Calculate the root mean squared error between 'y_true' and 'y_predict'
# Return the score
pass
# Test the function
score = performance([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
score
# [0.9228556485355649, 0.6870225614927066]
- Split
features
andtarget
datasets into training/test data (80/20) - For reproducibility, use
random_state=42
from sklearn.model_selection import train_test_split
# Split the data into training and test subsets
x_train, x_test, y_train, y_test = None
- Import the
DecisionTreeRegressor
class - Run a baseline model for later comparison using the datasets created above
- Generate predictions for test dataset and calculate the performance measures using the function created above
- Use
random_state=45
for tree instance - Record your observations
# Import DecisionTreeRegressor
# Instantiate DecisionTreeRegressor
# Set random_state=45
regressor = None
# Fit the model to training data
# Make predictions on the test data
y_pred = None
# Calculate performance using the performance() function
score = None
score
# [0.5961521990414137, 55656.48543887347] - R2, RMSE
- Find the best tree depth using depth range: 1-30
- Run the regressor repeatedly in a
for
loop for each depth value - Use
random_state=45
for reproducibility - Calculate RMSE and r-squared for each run
- Plot both performance measures for all runs
- Comment on the output
# Your code here
- Repeat the above process for
min_samples_split
- Use a range of values from 2-10 for this hyperparameter
- Use
random_state=45
for reproducibility - Visualize the output and comment on results as above
# Your code here
- Use the best values for
max_depth
andmin_samples_split
found in previous runs and run an optimized model with these values - Calculate the performance and comment on the output
# Your code here
- How about bringing in some more features from the original dataset which may be good predictors?
- Also, try tuning more hyperparameters like
max_features
to find a more optimal version of the model
# Your code here
In this lab, we looked at applying a decision-tree-based regression analysis on the Ames Housing dataset. We saw how to train various models to find the optimal values for hyperparameters.