Update Week 5

f0d2d42f · Oliver Müller · dda6197b · dda6197b · f0d2d42f · dda6197b
Commit f0d2d42f authored 3 years ago by Oliver Müller
--- a/Week 05/micromortgages.Rmd
+++ b/Week 05/micromortgages.Rmd
---
-title: "Data Science for Business - Week 05: Scoring Micro-Mortgages"
-author: "Oliver Mueller"
-output: html_notebook
-  html_document: 
-    toc: yes
-    number_sections: yes
-editor_options: 
-  markdown: 
-    wrap: 72
---
-
-# Initialize Notebook
-
-Load required packages.
-
-```{r load packages, warning=FALSE, message=FALSE}
-library(tidyverse)
-library(ggthemr)
-library(gmodels)
-library(tidymodels)
-library(stargazer)
-
-```
-
-Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting.
-
-```{r setup}
-knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
-rm(list=ls())
-set.seed(42)
-options(scipen=10000)
-ggthemr('fresh')
-options(yardstick.event_first = FALSE)
-
-```
-
-# Problem Description
-
-In India, there are about 20 million home loan (mortgage) aspirants working in the informal sector:
-
-   Monthly income between INR 20,000-25,000 (\$ 325-400)
-
-   Typically no formal accounts and documents (e.g., tax returns, income proofs, bank statements)
-
-   Often use services of money lenders with interest rates between 30 and 60% per annum
-
-Providing mortgages to this group of customers requires to quickly and efficiently assess their creditworthiness. Due to a lack of formal documents and objective data, most financial institutions perform interview-based processes to decide about these loan requests.
-
-Strengths of the current process:
-
-   Interview-based field assessment
-
-   Relaxation of document requirements
-
-Weaknesses of the current process:
-
-   Costly (total transaction costs as high as 30% of loan volume)
-
-   Subjective judgments; depends on individual skills and motivations
-
-   Low reliability across branches and credit officers
-
-   Risk of corruption and fraud
-
-To develop a classification model we have historical data about approx. 1,200 applications and the corresponding decisions made by credit officers. A detailed description of the dataset can be found here in the file `variables.pdf`.
-
-# Load Data
-
-Read data from CSV file.
-
-```{r}
-data <- read_csv("micromortgages.csv")
-
-```
-
-# Explore Data
-
-Before plotting our data, we have to do some minimal preprocessing (i.e., code the response variable as a factor).
-
-```{r}
-data$Decision <- as.factor(data$Decision)
-
-```
-
-Now, we can plot the distribution of the the response variable `Decision`.
-
-```{r}
-ggplot(data = data) +
-  geom_histogram(mapping = aes(x = Decision), stat="count")
-  
-```
-
-Create a boxplot for visualizing the relationship between a continuous variable (here: `TotInc`) and the response.
-
-```{r}
-ggplot(data=data) +
-  geom_boxplot(mapping = aes(y = TotInc, x= Decision)) 
-
-```
-
-Create a crosstable between a categorical variable (here: `Gender`) and the response.
-
-```{r}
-CrossTable(data$Decision, data$Gender, 
-           prop.r = FALSE, prop.c = TRUE, prop.t = FALSE, prop.chisq = FALSE)
-
-```
-
-# Tidymodels
-
-The tidymodels framework is a collection of packages for modeling and machine learning using tidy principles.
-
-Tidymodels defines a standard interface and workflow to:
-
-1.  Build a model using different algorithms and engines
-
-2.  Preprocess your data
-
-3.  Evaluate your model with resampling
-
-4.  Tune model hyperparameters
-
-Here (<https://www.tidymodels.org/start/>) you can find a very good getting started guide for tidymodels.
-
-## 1. Build a model
-
-Let's first look at how we can specify and fit a simple logistic regression model with `tidymodels`.
-
-First, we specify the model we want (`logistic_reg`) and the engine to be used to (later) fit the model.
-
-```{r}
-logit_mod <- logistic_reg() %>% 
-  set_engine("glm")
-
-```
-
-Now, we actually `fit` the model to data. Here, we have to define the data and the predictor and response variables (using R's usual formula interface).
-
-```{r}
-logit_fit <- logit_mod %>% 
-  fit(Decision ~ Gender + TotInc, data = data)
-
-logit_fit
-
-```
-
-With the `predict` function, we can apply the fitted model to make predictions on data.
-
-```{r}
-preds_prob <- predict(logit_fit, new_data = data, type = "prob")
-preds_label <- predict(logit_fit, new_data = data, type = "class")
-
-data_w_preds <- 
-  data %>%
-  bind_cols(preds_prob, preds_label)
-
-```
-
-There are multiple evaluation metrics, like `accuracy` and AUC (`roc_auc`).
-
-```{r}
-data_w_preds %>% 
-  accuracy(truth = Decision, .pred_class)
-
-data_w_preds %>% 
-  roc_auc(truth = Decision, .pred_1)
-
-```
-
-## 2. Preprocess data
-
-So far, the modeling process was not very different from the usual process. Now, we will some more advanced features of `tidymodels`. Maybe the most important part is preprocessing the data and doing "feature engineering".
-
-First, we will make a classic 80/20 train/test split.
-
-```{r}
-data_split <- initial_split(data, prop = .8)
-train_data <- training(data_split)
-test_data  <- testing(data_split)
-
-```
-
-Next, we create a so-called `recipe` to define a series of preprocessing steps. Note that the `data` passed to `recipe` does not need to be the complete data that will be used to train the steps. The recipe only needs to know the names and types of data that will be used. For large data sets, `head` could be used to pass the recipe a smaller data set to save time and memory.
-
-```{r}
-prep_recipe <- 
-  recipe(Decision ~ Gender + TotInc, data = train_data)
-
-```
-
-```{r}
-summary(prep_recipe)
-
-```
-
-Let's add some actual steps to the recipe. We first normalize all numerical variables (step\_normalize), then dummy code (`step_dummy`) all nominal variables, except for the outcome variable, and finally remove predictors with almost no variance (`step_nzv`).
-
-```{r}
-prep_recipe <- 
-  recipe(Decision ~ Gender + TotInc, data = train_data) %>% 
-  step_normalize(all_numeric()) %>% 
-  step_dummy(all_nominal(), -all_outcomes()) %>% 
-  step_nzv(all_predictors())
-
-```
-
-A `workflow` allows us to glue the model and the recipe together...
-
-```{r}
-microcredit_wflow <- workflow() %>% 
-  add_model(logit_mod) %>% 
-  add_recipe(prep_recipe)
-
-microcredit_wflow
-
-```
-
-... and `fit` the result on some data.
-
-```{r}
-microcredit_fit <- 
-  microcredit_wflow %>% 
-  fit(data = train_data)
-
-```
-
-```{r}
-microcredit_fit
-```
-
-We can now use the fitted model like usual.
-
-```{r}
-preds_prob <- predict(microcredit_fit, new_data = test_data, type = "prob")
-preds_label <- predict(microcredit_fit, new_data = test_data, type = "class")
-
-test_w_preds <- 
-  test_data %>%
-  bind_cols(preds_prob, preds_label)
-
-```
-
-```{r}
-test_w_preds %>% 
-  accuracy(truth = Decision, .pred_class)
-
-test_w_preds %>% 
-  roc_auc(truth = Decision, .pred_1)
-
-```
-
-## 3. Evaluation with Resampling
-
-Using a single train/test split can lead to unreliable performance estimates. Hence, we usually perform cross-validation. The figure below illustrates our overall data splitting strategy.
-
-![](resampling.png)
-
-We can use the `vfold_cv` function on the training data to implement the above strategy. Note that other resampling strategies are available in `tidymodels`.
-
-```{r}
-folds <- vfold_cv(train_data, v = 10)
-folds$splits
-
-```
-
-You can also look at the actual `analysis` and `assessment` sets for each fold.
-
-```{r}
-folds$splits[[1]] %>% analysis()
-folds$splits[[1]] %>% assessment()
-
-```
-
-We can now fit and evaluate our workflow on all resamples with one single call to the `fit_resamples` function. Note that also the steps in the recipe will be "trained" on the individual folds.
-
-```{r}
-microcredit_fit_rs <- microcredit_wflow %>% 
-  fit_resamples(folds)
-
-```
-
-We can inspect the results of every single fold...
-
-```{r}
-microcredit_fit_rs$.metrics
-
-```
-
-... or aggregate the metrics over all folds.
-
-```{r}
-collect_metrics(microcredit_fit_rs)
-
-```
-
-When comparing the results of the cross validation procedure with the results on the test set, we can see a clear difference, i.e., the resampling results are usually more conservative.
-
-```{r}
-test_w_preds %>% 
-  accuracy(truth = Decision, .pred_class)
-
-test_w_preds %>% 
-  roc_auc(truth = Decision, .pred_1)
-
-```
-
-The real beauty of `tidymodels` is that we can very easily re-run the whole experiment with a different model, like knn.
-
-```{r}
-knn_model <- nearest_neighbor() %>% 
-  set_engine("kknn") %>% 
-  set_mode("classification")
-
-microcredit_wflow <- microcredit_wflow %>% 
-  update_model(knn_model) 
-
-microcredit_fit_rs <- microcredit_wflow %>% 
-  fit_resamples(folds)
-
-collect_metrics(microcredit_fit_rs)
-
-```
-
-## 4. Hyperparameter Tuning
-
-We will get back to this topic, once we have learned about models with hyperparameters...
--- a/Week 05/micromortgages_w_tidymodels.Rmd
+++ b/Week 05/micromortgages_w_tidymodels.Rmd
+---
+title: "Data Science for Business - Week 05: Scoring Micro-Mortgages"
+author: "Oliver Mueller"
+output: html_notebook
+editor_options: 
+  markdown: 
+    wrap: 72
+---
+
+## Initialize notebook
+
+Load required packages.
+
+```{r load packages, warning=FALSE, message=FALSE}
+library(tidyverse)
+theme_set(theme_classic())
+library(tidymodels)
+
+```
+
+Set up workspace, i.e., remove all existing data from working memory and
+initialize the random number generator with a fixed seed.
+
+```{r setup}
+rm(list=ls())
+set.seed(42)
+
+```
+
+## Problem description
+
+In India, there are about 20 million home loan (mortgage) aspirants
+working in the informal sector: \* Monthly income between INR
+20,000-25,000 (\$ 325-400) \* Typically no formal accounts and documents
+(e.g., tax returns, income proofs, bank statements) \* Often use
+services of money lenders with interest rates between 30 and 60% per
+annum
+
+Providing mortgages to this group of customers requires to quickly and
+efficiently assess their creditworthiness. Due to a lack of formal
+documents and objective data, most financial institutions perform
+interview-based processes to decide about these loan requests:
+
+Strength of the current process:
+
+-   Interview-based field assessment
+
+-   Relaxation of document requirements
+
+Weaknesses of the current process:
+
+-   Costly (total transaction costs as high as 30% of loan volume)
+
+-   Subjective judgments; depends on individual skills and motivations
+
+-   Low reliability across branches and credit officers
+
+-   Risk of corruption and fraud
+
+## Data
+
+Read data from CSV file.
+
+```{r}
+data <- read_csv("micromortgages.csv")
+
+```
+
+Make initial train/test split.
+
+```{r}
+split <- initial_split(data, strata = Decision)
+train <- training(split)
+test <- testing(split)
+
+```
+
+## Model, recipe, and workflow
+
+Define two models, logistic regression and knn, with *tidymodels*. Note
+this are just specifications of models, no actual model fitting is
+performed at this point.
+
+```{r}
+model_spec_logit <- logistic_reg() %>% 
+  set_engine("glm") %>% 
+  set_mode("classification")
+
+```
+
+```{r}
+model_spec_nn <- nearest_neighbor(neighbors = tune()) %>% 
+  set_engine("kknn") %>% 
+  set_mode("classification")
+
+```
+
+Next, we define a recipe for data processing. The recipe is also just a
+specification of steps that should be performed, they are not executed
+here.
+
+```{r}
+rec <- recipe(Decision ~ ., data = train) %>% 
+  step_mutate(Decision = as.factor(Decision)) %>% 
+  step_scale(all_numeric_predictors()) %>% 
+  step_dummy(all_nominal_predictors()) 
+
+```
+
+Sometimes, it's difficult to track and trace what a recipe actually
+does. With the bake function we can execute the recipe on some data and
+peak into the results.
+
+```{r}
+train_baked <- bake(prep(rec), new_data = train)
+
+```
+
+Finally, we can combine the model specification and recipe to a
+workflow.
+
+```{r}
+wkf_logit <- workflow() %>% 
+  add_recipe(rec) %>% 
+  add_model(model_spec_logit)
+
+```
+
+```{r}
+wkf_nn <- workflow() %>% 
+  add_recipe(rec) %>% 
+  add_model(model_spec_nn)
+
+```
+
+## Tuning and resampling
+
+We are now (almots) ready to apply our workflows on the training data.
+But recall that one of our learners, the knn learner, has a
+hyperparameter that can be tuned. For tuning, we will define a search
+grid and perform k-fold cross validation to systematically try out
+different values for the hyperparameter.
+
+Specify the cross-validation strategy.
+
+```{r}
+folds <- vfold_cv(train, v = 5, strata = Decision)
+
+```
+
+Set up a random search grid for hyperparameter tuning.
+
+```{r}
+grid_nn <- grid_random(
+  neighbors(),
+  size = 3
+)
+grid_nn
+
+```
+
+Perform the actual hyperparameter tuning.
+
+```{r}
+res_nn <- tune_grid(
+  wkf_nn,
+  resamples = folds,
+  grid = grid_nn,
+  control = control_grid(save_pred = TRUE)
+)
+
+```
+
+Our logistic regression model does not have any hyperparameters;
+nonetheless we can fit using cross-validation to get a sense of the
+variance in its predictive performance.
+
+```{r}
+res_logit <- tune_grid(
+  wkf_logit,
+  resamples = folds,
+  control = control_grid(save_pred = TRUE)
+)
+
+```
+
+## Results of Hyperparameter Tuning
+
+Show accuracy metrics for the different knn models.
+
+```{r}
+collect_metrics(res_nn)
+
+```
+
+Which hyperparameters are best?
+
+```{r}
+show_best(res_nn, "roc_auc")
+
+```
+
+## Last Fit
+
+Now that we know the best hyperparameters for the knn model, we refit it
+on the complete training data (so that we don't waste 20% of our data
+due to cross-validation).
+
+Get parameters with best AUC.
+
+```{r}
+best_auc_nn <- select_best(res_nn, "roc_auc")
+best_auc_nn
+
+```
+
+Pass these parameters to the workflow.
+
+```{r}
+wkf_nn_final <- finalize_workflow(
+  wkf_nn,
+  best_auc_nn
+)
+wkf_nn_final
+
+```
+
+And do a final fit of the workflow.
+
+```{r}
+wkf_nn_final_fit <- last_fit(wkf_nn_final, split)
+collect_metrics(wkf_nn_final_fit)
+
+```
+
+And let's not forget about our logistic regression model. Here are
+accuracy and AUC for this model.
+
+```{r}
+collect_metrics(res_logit)
+
+```
+
+Wow!
--- a/Week 05/resampling.png
+++ b/Week 05/resampling.png