Lab 08 Machine Learning

RE 519 Data Analytics and Visualization | Autumn 2025


In Lab 8, we will briefly introduce decision tree (supervised) and k-means clustering (unsupervised), two classical machine learning methods. The due day for each lab can be found on the course wesbite. The submission should include Rmd, html, and any other files required to rerun the codes.

From Lab 4, the use of any generative AI tool (ChatGPT, Copilot, etc.) is allowed for coding assignments, but not allowed for write-up assignment. Meanwhile, I still encourage you to write codes by yourself and use AI tools as a way to debug and explore new knowledge. More information about Academic Integrityand the Use of AI.


Lab 08-A: Decision Trees

For the part A, we still use the King County sales data from maintained by Andy Krause at Zillow. You can find the datasets and Readme file in this repository. We are going to use decision trees to predict sale prices.

We will use tidymodels, which is a R package for modeling and machine learning in R using tidyverse principles. It provides us a framework.

#install.packages("tidymodels") # you only need to install once
#install.packages("devtools") # if you are using a Windows computer
#devtools::install_github('andykrause/kingCoData') # you only need to install once
#install.packages("rpart.plot") # you only need to install once
library(rpart.plot) # the package for visualizing trees
library(tidymodels)  
## Warning: package 'ggplot2' was built under R version 4.4.3
library(kingCoData) # load the data package
data(kingco_sales) # load the sale data
sales <- kingco_sales %>% # only select the sales in 2023
  filter(sale_date >= as.Date("2023-01-01") & sale_date <= as.Date("2023-12-31"))

Data Preparation

sales_pred <- sales %>%
  mutate(view_mountains = view_rainier + view_olympics + view_cascades + view_territorial,
         view_water = view_sound + view_lakewash + view_lakesamm + view_otherwater)
# select some features (subjective here)
sales_pred <- sales_pred %>%
  dplyr::select(sale_price, area, city, year_built, sqft_lot, sqft, beds, view_skyline, view_mountains, view_water) 

Before train the model, we need to split the dataset into a training set and a testing set. The training data will be used to train the models, while the testing data will be reserved to evaluate how well the model generalizes to unseen observations.

set.seed(123)
split <- initial_split(sales_pred, prop = 0.8) # 80% training data
train_data <- training(split)
test_data  <- testing(split)

We are going to build a recipe for data preprocessing. A recipe is the steps to do data preparation under tidymodels framework. For Lasso and ridge, we need to normalize features. Please note that The object rec is a recipe that has not been trained on data yet (for example, does not normalize).

rec <- recipe(sale_price ~ ., data = train_data) %>% # model and dataset
  step_zv(all_numeric())  # remove variables that contain only a single value

Model Specification

We start from specify the tree. In tidymodels, the model type (decision_tree) and the computational engine (rpart) are separated.

tree_spec <- decision_tree(
  cost_complexity = tune(),  # CCP alpha
  tree_depth = tune(),       # max depth
  min_n = tune()             # min samples split
) %>%
  set_mode("regression") %>%
  set_engine("rpart")

A workflow is a container that bundles together a recipe (data preprocessing) and a model specification. use workflow() to put receipt and models together. We have not train yet!

tree_workflow <- workflow() %>%
  add_recipe(rec) %>%
  add_model(tree_spec)

Tuning Grid for Penalty

We can define the search space for hyperparameters. Note: range = c(a, b) specifies the log10 scale from \(10^a\) - \(10^b\).

tree_grid <- grid_regular(
  cost_complexity(range = c(-10, -3)),  # log10
  tree_depth(range = c(5, 25)),
  min_n(range = c(15, 40)),
  levels = 6 # try 6 values between the previous ranges
) 

In this step, we have tried different ranges for hyperparameters. For example, we will see the following results when: