HDB Rental Prices

regression linear regression random forest knn xgboost

Let’s try and predict the mediaa rental price per sqft of HDA flats given information about their location, age, distance to MRT station….

Mark Y
2024-02-09

Predicting HDB Rental Prices

I started Module 1 of SMU Academy’s Predictive Analytics and Machine Learning class in January 2024. This data set was given in class for homework. I shall use it as practice to document what was learnt in class.

At this stage of my learning, the primary objective of this exercise is to familiarize myself with the complete workflow from start to finish. Repeatedly fine-tuning hyper-parameters, or complex feature engineering in order to come up with the BEST predictive model, is not an objective at this stage.

Set dependenies and load necessary packages

As some of the code takes quite a bit of time to run, I’ve loaded the RData file to speed things up.

rm(list = ls())
sessionInfo()

# Set packages and dependencies
pacman::p_load("tidyverse", #for tidy data science practice
               "tidymodels", "workflows",# for tidy machine learning
               "pacman", #package manager
               "devtools", #developer tools
               "Hmisc", "skimr", "broom", "modelr",#for EDA
               "jtools", "huxtable", "interactions", # for EDA
               "ggthemes", "ggstatsplot", "GGally",
               "scales", "gridExtra", "patchwork", "ggalt", "vip",
               "ggstance", "ggfortify", # for ggplot
               "DT", "plotly", #interactive Data Viz
               # Lets install some ML related packages that will help tidymodels::
               "usemodels", "poissonreg", "agua", "sparklyr", "dials", #load computational engines
               "doParallel", # for parallel processing (speedy computation)
               "ranger", "xgboost", "glmnet", #random forest
               "janitor")
load("rental_data.RData")

Import the data

Read in the dataset rental_price.csv, and use skim() to provide a quick summary and overview of it.

# read in data
df <- read_csv("rental_price.csv")
skim(df)

Property name and region (which are currently characters) will need to be reclassified as factors. The same goes for ref_year and district. unit_id will be assigned a role of “an id” later in the recipe.

# make factors
data <-
  df %>% 
  mutate(across(c(district, region, ref_year, property), as.factor))

Correlation check

Let’s check for correlation among numeric variables.

# check correlation between numeric, exclude loc_x, loc_y

data %>% 
  select (-unit_id, -loc_x, -loc_y) %>% 
  select_if(is.numeric) %>% 
  as.matrix(.) %>% 
  rcorr() %>% 
  tidy() %>% 
  arrange(desc(abs(estimate)))
   ┌─────────────────────────────────────────────────────┐
   │ column1      column2      estimate      n   p.value │
   ├─────────────────────────────────────────────────────┤
   │ price_medi   age           -0.479    3128    0      │
   │ an                                                  │
   │ price_medi   dist_to_mr    -0.203    3128    0      │
   │ an           t                                      │
   │ dist_to_mr   age            0.0331   3128    0.0644 │
   │ t                                                   │
   └─────────────────────────────────────────────────────┘

Column names: column1, column2, estimate, n, p.value

Correlation among numeric variables appear to be fine. None seem to be highly correlated with one another.

EDA

Before we proceed with splitting the data, let’s do some basic EDA.

# price by region

data %>% 
  ggplot(aes(x = district,
              y = price_median)
         ) + 
  geom_boxplot(outlier.shape = NA) +
  geom_point(alpha = 0.15, aes(color = region))+
  theme_bw() +
  labs(x = "District",
       y = "Price Median $ psf")+
  theme(legend.position = "bottom")

Here we see the influence of location (as specified by district and region) on price.

data %>% 
  ggplot(aes(x = dist_to_mrt,
             y = price_median,
             color = region)
         ) +
  geom_point(alpha = 0.2) +
  geom_smooth(aes(color = region),
              method = "lm")+
  theme_bw()+
  labs(x = "Distance to MRT Station (km)",
       y = "Price Median $ psf") +
  theme(legend.position = "bottom")

Distance to MRT appears to have a stronger influence on price in region_CCR, and a more muted influence on price in region_OCR and region_RCR.

Split the data

Let’s proceed with splitting the data into a training and testing set. At the same time, let’s create folds for tuning/cross validation purposes.

# split data
set.seed(2024020101)
data_split <-
  data %>% 
  initial_split(strata = region) # strata by region
data_train <-
  data_split %>% 
  training()
data_test <-
  data_split %>% 
  testing()
data_fold <-
  data_train %>% 
  vfold_cv(v = 10, strata = region)

Let’s make some recipes

Let’s make a few recipes for modeling.

base_rec <-
  recipes::recipe(formula = price_median ~ .,
                 data = data_train) %>% 
  update_role(unit_id, new_role = "unit_id") %>% 
  update_role(property, new_role = "name_id") %>% 
  step_rm(loc_x, loc_y) %>%  # remove geo locator, property name
  step_dummy(all_nominal_predictors())

log_rec <-
  base_rec %>%
  step_log(price_median, age, dist_to_mrt) # age, dist_mrt

Make some models

Let’s make a few models to predict price_median, which is the rental price per square foot for each property. We shall set up 4 models: a simple linear regression model, random forest, xgboost, and k-nearest neighbor model. I will set up a basic model, as well as a model for tuning of hyper-parameters.

# linear regression
lm_spec <-
  linear_reg()

# random forest
rf_spec <-
  rand_forest(trees = 1000L) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

rf_spec_for_tuning <-
  rf_spec %>% 
  set_args(mtry = tune(),
           min_n = tune())

# knn
knn_spec <-
  nearest_neighbor() %>% 
  set_engine("kknn") %>% 
  set_mode("regression")

knn_spec_for_tuning <-
  knn_spec %>% 
  set_args(neighbors = tune(),
           weight_func = "optimal",
           dist_power = tune())

# xgboost
xgb_spec <-
  boost_tree(trees = 1000L) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

xgb_spec_for_tuning <-
  xgb_spec %>% 
  set_args(tree_depth = tune(),
           min_n = tune(),
           loss_reduction = tune(),                     
           sample_size = tune(),
           mtry = tune(),        
           learn_rate = tune())        

Let’s combine everything into a workflow set. xgb needs a separate workflow set as it only works with dummy variables. Combine both using bind_rows into 1 model_set.

base_set <-
  workflow_set (
    preproc = list(base_rec, log_rec), #preprocessor
    models = list(rf_spec, knn_spec, xgb_spec,
                  rf_spec_for_tuning, knn_spec_for_tuning, xgb_spec_for_tuning), #model
    cross = TRUE) #default is cross = TRUE

First fit and tuning

Let’s use fit_resamples to do a first fit as well as tune the hyper-parameters. There are still lots of things I don’t know. For example, what does “grid = 11” mean? Why not 5? Why not 25?

# first fit
set.seed(2024020102)
doParallel::registerDoParallel()
base_results <-
  workflow_map(base_set,
               fn = "tune_grid",
               resamples = data_fold,
               grid = 11,
               verbose = TRUE)

Once your machine is done tuning for hyper-parameters, you can collect the results, by metrics. Here we used “rmse”. My machine is rather incompetent, so this process is taking a while. Fortunately, the dataset is rather small.

base_results %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  arrange(mean) %>% 
  print(n = 5)
# A tibble: 72 × 9
  wflow_id       .config preproc model .metric .estimator   mean     n
  <chr>          <chr>   <chr>   <chr> <chr>   <chr>       <dbl> <int>
1 recipe_2_boos… Prepro… recipe  boos… rmse    standard   0.0559    10
2 recipe_2_boos… Prepro… recipe  boos… rmse    standard   0.0565    10
3 recipe_2_boos… Prepro… recipe  boos… rmse    standard   0.0598    10
4 recipe_2_rand… Prepro… recipe  rand… rmse    standard   0.0670    10
5 recipe_2_boos… Prepro… recipe  boos… rmse    standard   0.0684    10
# ℹ 67 more rows
# ℹ 1 more variable: std_err <dbl>

Visualize results

We can also visualize the results using a basic autoplot(), or rank the results using workflowsets::rank_results().

autoplot(base_results) +
  theme_bw() +
  theme(legend.position = "bottom")

# rank results
base_results %>% 
  workflowsets::rank_results(rank_metric = "rmse") %>% 
  filter(.metric == "rmse") %>% 
  print(n = 5)
# A tibble: 72 × 9
  wflow_id     .config .metric   mean std_err     n preprocessor model
  <chr>        <chr>   <chr>    <dbl>   <dbl> <int> <chr>        <chr>
1 recipe_2_bo… Prepro… rmse    0.0559 1.15e-3    10 recipe       boos…
2 recipe_2_bo… Prepro… rmse    0.0565 9.01e-4    10 recipe       boos…
3 recipe_2_bo… Prepro… rmse    0.0598 1.56e-3    10 recipe       boos…
4 recipe_2_ra… Prepro… rmse    0.0670 1.36e-3    10 recipe       rand…
5 recipe_2_bo… Prepro… rmse    0.0684 1.81e-3    10 recipe       boos…
# ℹ 67 more rows
# ℹ 1 more variable: rank <int>

The best recipe-model combination is recipe 2, which is log_rec and parameters from boost_tree_3. If you recall, boost_tree_3 is the basic xgboost model with no hyper-parameters for tuning. Let’s extract the hyper-parameters using select_best.

tune_param <-
  base_results %>% 
  extract_workflow_set_result("recipe_2_boost_tree_3") %>% 
  select_best(metric = "rmse")

Finalize workflow and last fit

Let’s apply the parameters and finalize the workflow for the best recipe/model combination.

log_rec_xbg_boost_wflow <-
  workflow() %>% 
  add_recipe(log_rec) %>% 
  add_model(xgb_spec) %>% 
  finalize_workflow(tune_param)

Finally, we can fit the model by performing last_fit to the split data, data_split.

log_rec_xbg_boost_final_fit <-
  log_rec_xbg_boost_wflow %>% 
  last_fit(data_split)

Collect predictions

We can collect metrics as well as predictions from the last fitted model.

log_rec_xbg_boost_final_fit %>% 
  collect_metrics()
    ┌───────────────────────────────────────────────────┐
    │ .metric   .estimator   .estimate   .config        │
    ├───────────────────────────────────────────────────┤
    │ rmse      standard         0.057   Preprocessor1_ │
    │                                    Model1         │
    │ rsq       standard         0.959   Preprocessor1_ │
    │                                    Model1         │
    └───────────────────────────────────────────────────┘

Column names: .metric, .estimator, .estimate, .config

log_rec_xbg_boost_final_fit %>% 
  collect_predictions() %>% 
  ggplot(aes(x = price_median,
             y = .pred)) +
  geom_point(alpha = 0.2)+
  geom_abline(lty = 2,
              color = "dodgerblue") +
  labs(x = "Actual Median Price $psf",
       y = "Predicted Median Price $psf")+
  theme_bw()

Visualize important features

We can use the vip() to see which were the most important features.

log_rec_xbg_boost_final_fit %>% 
  extract_fit_parsnip() %>% 
  vip(geom = "point") +
  theme_bw()

As we can see, location location location, or in this case region is the most important predictor of price, followed by age and distance to MRT station.