Let’s try and predict the mediaa rental price per sqft of HDA flats given information about their location, age, distance to MRT station….
I started Module 1 of SMU Academy’s Predictive Analytics and Machine Learning class in January 2024. This data set was given in class for homework. I shall use it as practice to document what was learnt in class.
At this stage of my learning, the primary objective of this exercise is to familiarize myself with the complete workflow from start to finish. Repeatedly fine-tuning hyper-parameters, or complex feature engineering in order to come up with the BEST predictive model, is not an objective at this stage.
As some of the code takes quite a bit of time to run, I’ve loaded the RData file to speed things up.
rm(list = ls())
sessionInfo()
# Set packages and dependencies
pacman::p_load("tidyverse", #for tidy data science practice
"tidymodels", "workflows",# for tidy machine learning
"pacman", #package manager
"devtools", #developer tools
"Hmisc", "skimr", "broom", "modelr",#for EDA
"jtools", "huxtable", "interactions", # for EDA
"ggthemes", "ggstatsplot", "GGally",
"scales", "gridExtra", "patchwork", "ggalt", "vip",
"ggstance", "ggfortify", # for ggplot
"DT", "plotly", #interactive Data Viz
# Lets install some ML related packages that will help tidymodels::
"usemodels", "poissonreg", "agua", "sparklyr", "dials", #load computational engines
"doParallel", # for parallel processing (speedy computation)
"ranger", "xgboost", "glmnet", #random forest
"janitor")
load("rental_data.RData")
Read in the dataset rental_price.csv
, and use skim()
to provide a quick summary and overview of it.
# read in data
df <- read_csv("rental_price.csv")
skim(df)
Property name and region (which are currently characters) will need to be reclassified as factors. The same goes for ref_year and district. unit_id
will be assigned a role of “an id” later in the recipe.
# make factors
data <-
df %>%
mutate(across(c(district, region, ref_year, property), as.factor))
Let’s check for correlation among numeric variables.
# check correlation between numeric, exclude loc_x, loc_y
data %>%
select (-unit_id, -loc_x, -loc_y) %>%
select_if(is.numeric) %>%
as.matrix(.) %>%
rcorr() %>%
tidy() %>%
arrange(desc(abs(estimate)))
┌─────────────────────────────────────────────────────┐
│ column1 column2 estimate n p.value │
├─────────────────────────────────────────────────────┤
│ price_medi age -0.479 3128 0 │
│ an │
│ price_medi dist_to_mr -0.203 3128 0 │
│ an t │
│ dist_to_mr age 0.0331 3128 0.0644 │
│ t │
└─────────────────────────────────────────────────────┘
Column names: column1, column2, estimate, n, p.value
Correlation among numeric variables appear to be fine. None seem to be highly correlated with one another.
Before we proceed with splitting the data, let’s do some basic EDA.
# price by region
data %>%
ggplot(aes(x = district,
y = price_median)
) +
geom_boxplot(outlier.shape = NA) +
geom_point(alpha = 0.15, aes(color = region))+
theme_bw() +
labs(x = "District",
y = "Price Median $ psf")+
theme(legend.position = "bottom")
Here we see the influence of location (as specified by district
and region
) on price.
data %>%
ggplot(aes(x = dist_to_mrt,
y = price_median,
color = region)
) +
geom_point(alpha = 0.2) +
geom_smooth(aes(color = region),
method = "lm")+
theme_bw()+
labs(x = "Distance to MRT Station (km)",
y = "Price Median $ psf") +
theme(legend.position = "bottom")
Distance to MRT appears to have a stronger influence on price in region_CCR, and a more muted influence on price in region_OCR and region_RCR.
Let’s proceed with splitting the data into a training and testing set. At the same time, let’s create folds for tuning/cross validation purposes.
# split data
set.seed(2024020101)
data_split <-
data %>%
initial_split(strata = region) # strata by region
data_train <-
data_split %>%
training()
data_test <-
data_split %>%
testing()
data_fold <-
data_train %>%
vfold_cv(v = 10, strata = region)
Let’s make a few recipes for modeling.
base_rec <-
recipes::recipe(formula = price_median ~ .,
data = data_train) %>%
update_role(unit_id, new_role = "unit_id") %>%
update_role(property, new_role = "name_id") %>%
step_rm(loc_x, loc_y) %>% # remove geo locator, property name
step_dummy(all_nominal_predictors())
log_rec <-
base_rec %>%
step_log(price_median, age, dist_to_mrt) # age, dist_mrt
Let’s make a few models to predict price_median
, which is the rental price per square foot for each property. We shall set up 4 models: a simple linear regression model, random forest, xgboost, and k-nearest neighbor model. I will set up a basic model, as well as a model for tuning of hyper-parameters.
# linear regression
lm_spec <-
linear_reg()
# random forest
rf_spec <-
rand_forest(trees = 1000L) %>%
set_engine("ranger") %>%
set_mode("regression")
rf_spec_for_tuning <-
rf_spec %>%
set_args(mtry = tune(),
min_n = tune())
# knn
knn_spec <-
nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression")
knn_spec_for_tuning <-
knn_spec %>%
set_args(neighbors = tune(),
weight_func = "optimal",
dist_power = tune())
# xgboost
xgb_spec <-
boost_tree(trees = 1000L) %>%
set_engine("xgboost") %>%
set_mode("regression")
xgb_spec_for_tuning <-
xgb_spec %>%
set_args(tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune())
Let’s combine everything into a workflow set. xgb needs a separate workflow set as it only works with dummy variables. Combine both using bind_rows
into 1 model_set
.
Let’s use fit_resamples
to do a first fit as well as tune the hyper-parameters. There are still lots of things I don’t know. For example, what does “grid = 11” mean? Why not 5? Why not 25?
# first fit
set.seed(2024020102)
doParallel::registerDoParallel()
base_results <-
workflow_map(base_set,
fn = "tune_grid",
resamples = data_fold,
grid = 11,
verbose = TRUE)
Once your machine is done tuning for hyper-parameters, you can collect the results, by metrics. Here we used “rmse”. My machine is rather incompetent, so this process is taking a while. Fortunately, the dataset is rather small.
# A tibble: 72 × 9
wflow_id .config preproc model .metric .estimator mean n
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int>
1 recipe_2_boos… Prepro… recipe boos… rmse standard 0.0559 10
2 recipe_2_boos… Prepro… recipe boos… rmse standard 0.0565 10
3 recipe_2_boos… Prepro… recipe boos… rmse standard 0.0598 10
4 recipe_2_rand… Prepro… recipe rand… rmse standard 0.0670 10
5 recipe_2_boos… Prepro… recipe boos… rmse standard 0.0684 10
# ℹ 67 more rows
# ℹ 1 more variable: std_err <dbl>
We can also visualize the results using a basic autoplot()
, or rank the results using workflowsets::rank_results()
.
autoplot(base_results) +
theme_bw() +
theme(legend.position = "bottom")
# rank results
base_results %>%
workflowsets::rank_results(rank_metric = "rmse") %>%
filter(.metric == "rmse") %>%
print(n = 5)
# A tibble: 72 × 9
wflow_id .config .metric mean std_err n preprocessor model
<chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr>
1 recipe_2_bo… Prepro… rmse 0.0559 1.15e-3 10 recipe boos…
2 recipe_2_bo… Prepro… rmse 0.0565 9.01e-4 10 recipe boos…
3 recipe_2_bo… Prepro… rmse 0.0598 1.56e-3 10 recipe boos…
4 recipe_2_ra… Prepro… rmse 0.0670 1.36e-3 10 recipe rand…
5 recipe_2_bo… Prepro… rmse 0.0684 1.81e-3 10 recipe boos…
# ℹ 67 more rows
# ℹ 1 more variable: rank <int>
The best recipe-model combination is recipe 2, which is log_rec
and parameters from boost_tree_3. If you recall, boost_tree_3 is the basic xgboost model with no hyper-parameters for tuning. Let’s extract the hyper-parameters using select_best
.
tune_param <-
base_results %>%
extract_workflow_set_result("recipe_2_boost_tree_3") %>%
select_best(metric = "rmse")
Let’s apply the parameters and finalize the workflow for the best recipe/model combination.
log_rec_xbg_boost_wflow <-
workflow() %>%
add_recipe(log_rec) %>%
add_model(xgb_spec) %>%
finalize_workflow(tune_param)
Finally, we can fit the model by performing last_fit
to the split data, data_split
.
log_rec_xbg_boost_final_fit <-
log_rec_xbg_boost_wflow %>%
last_fit(data_split)
We can collect metrics as well as predictions from the last fitted model.
log_rec_xbg_boost_final_fit %>%
collect_metrics()
┌───────────────────────────────────────────────────┐
│ .metric .estimator .estimate .config │
├───────────────────────────────────────────────────┤
│ rmse standard 0.057 Preprocessor1_ │
│ Model1 │
│ rsq standard 0.959 Preprocessor1_ │
│ Model1 │
└───────────────────────────────────────────────────┘
Column names: .metric, .estimator, .estimate, .config
log_rec_xbg_boost_final_fit %>%
collect_predictions() %>%
ggplot(aes(x = price_median,
y = .pred)) +
geom_point(alpha = 0.2)+
geom_abline(lty = 2,
color = "dodgerblue") +
labs(x = "Actual Median Price $psf",
y = "Predicted Median Price $psf")+
theme_bw()
We can use the vip()
to see which were the most important features.
log_rec_xbg_boost_final_fit %>%
extract_fit_parsnip() %>%
vip(geom = "point") +
theme_bw()
As we can see, location location location, or in this case region
is the most important predictor of price, followed by age and distance to MRT station.