Skip to content
Snippets Groups Projects
Commit f0d2d42f authored by Oliver Müller's avatar Oliver Müller
Browse files

Update Week 5

parent dda6197b
No related branches found
No related tags found
No related merge requests found
---
title: "Data Science for Business - Week 05: Scoring Micro-Mortgages"
author: "Oliver Mueller"
output: html_notebook
html_document:
toc: yes
number_sections: yes
editor_options:
markdown:
wrap: 72
---
# Initialize Notebook
Load required packages.
```{r load packages, warning=FALSE, message=FALSE}
library(tidyverse)
library(ggthemr)
library(gmodels)
library(tidymodels)
library(stargazer)
```
Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting.
```{r setup}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
rm(list=ls())
set.seed(42)
options(scipen=10000)
ggthemr('fresh')
options(yardstick.event_first = FALSE)
```
# Problem Description
In India, there are about 20 million home loan (mortgage) aspirants working in the informal sector:
- Monthly income between INR 20,000-25,000 (\$ 325-400)
- Typically no formal accounts and documents (e.g., tax returns, income proofs, bank statements)
- Often use services of money lenders with interest rates between 30 and 60% per annum
Providing mortgages to this group of customers requires to quickly and efficiently assess their creditworthiness. Due to a lack of formal documents and objective data, most financial institutions perform interview-based processes to decide about these loan requests.
Strengths of the current process:
- Interview-based field assessment
- Relaxation of document requirements
Weaknesses of the current process:
- Costly (total transaction costs as high as 30% of loan volume)
- Subjective judgments; depends on individual skills and motivations
- Low reliability across branches and credit officers
- Risk of corruption and fraud
To develop a classification model we have historical data about approx. 1,200 applications and the corresponding decisions made by credit officers. A detailed description of the dataset can be found here in the file `variables.pdf`.
# Load Data
Read data from CSV file.
```{r}
data <- read_csv("micromortgages.csv")
```
# Explore Data
Before plotting our data, we have to do some minimal preprocessing (i.e., code the response variable as a factor).
```{r}
data$Decision <- as.factor(data$Decision)
```
Now, we can plot the distribution of the the response variable `Decision`.
```{r}
ggplot(data = data) +
geom_histogram(mapping = aes(x = Decision), stat="count")
```
Create a boxplot for visualizing the relationship between a continuous variable (here: `TotInc`) and the response.
```{r}
ggplot(data=data) +
geom_boxplot(mapping = aes(y = TotInc, x= Decision))
```
Create a crosstable between a categorical variable (here: `Gender`) and the response.
```{r}
CrossTable(data$Decision, data$Gender,
prop.r = FALSE, prop.c = TRUE, prop.t = FALSE, prop.chisq = FALSE)
```
# Tidymodels
The tidymodels framework is a collection of packages for modeling and machine learning using tidy principles.
Tidymodels defines a standard interface and workflow to:
1. Build a model using different algorithms and engines
2. Preprocess your data
3. Evaluate your model with resampling
4. Tune model hyperparameters
Here (<https://www.tidymodels.org/start/>) you can find a very good getting started guide for tidymodels.
## 1. Build a model
Let's first look at how we can specify and fit a simple logistic regression model with `tidymodels`.
First, we specify the model we want (`logistic_reg`) and the engine to be used to (later) fit the model.
```{r}
logit_mod <- logistic_reg() %>%
set_engine("glm")
```
Now, we actually `fit` the model to data. Here, we have to define the data and the predictor and response variables (using R's usual formula interface).
```{r}
logit_fit <- logit_mod %>%
fit(Decision ~ Gender + TotInc, data = data)
logit_fit
```
With the `predict` function, we can apply the fitted model to make predictions on data.
```{r}
preds_prob <- predict(logit_fit, new_data = data, type = "prob")
preds_label <- predict(logit_fit, new_data = data, type = "class")
data_w_preds <-
data %>%
bind_cols(preds_prob, preds_label)
```
There are multiple evaluation metrics, like `accuracy` and AUC (`roc_auc`).
```{r}
data_w_preds %>%
accuracy(truth = Decision, .pred_class)
data_w_preds %>%
roc_auc(truth = Decision, .pred_1)
```
## 2. Preprocess data
So far, the modeling process was not very different from the usual process. Now, we will some more advanced features of `tidymodels`. Maybe the most important part is preprocessing the data and doing "feature engineering".
First, we will make a classic 80/20 train/test split.
```{r}
data_split <- initial_split(data, prop = .8)
train_data <- training(data_split)
test_data <- testing(data_split)
```
Next, we create a so-called `recipe` to define a series of preprocessing steps. Note that the `data` passed to `recipe` does not need to be the complete data that will be used to train the steps. The recipe only needs to know the names and types of data that will be used. For large data sets, `head` could be used to pass the recipe a smaller data set to save time and memory.
```{r}
prep_recipe <-
recipe(Decision ~ Gender + TotInc, data = train_data)
```
```{r}
summary(prep_recipe)
```
Let's add some actual steps to the recipe. We first normalize all numerical variables (step\_normalize), then dummy code (`step_dummy`) all nominal variables, except for the outcome variable, and finally remove predictors with almost no variance (`step_nzv`).
```{r}
prep_recipe <-
recipe(Decision ~ Gender + TotInc, data = train_data) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_nzv(all_predictors())
```
A `workflow` allows us to glue the model and the recipe together...
```{r}
microcredit_wflow <- workflow() %>%
add_model(logit_mod) %>%
add_recipe(prep_recipe)
microcredit_wflow
```
... and `fit` the result on some data.
```{r}
microcredit_fit <-
microcredit_wflow %>%
fit(data = train_data)
```
```{r}
microcredit_fit
```
We can now use the fitted model like usual.
```{r}
preds_prob <- predict(microcredit_fit, new_data = test_data, type = "prob")
preds_label <- predict(microcredit_fit, new_data = test_data, type = "class")
test_w_preds <-
test_data %>%
bind_cols(preds_prob, preds_label)
```
```{r}
test_w_preds %>%
accuracy(truth = Decision, .pred_class)
test_w_preds %>%
roc_auc(truth = Decision, .pred_1)
```
## 3. Evaluation with Resampling
Using a single train/test split can lead to unreliable performance estimates. Hence, we usually perform cross-validation. The figure below illustrates our overall data splitting strategy.
![](resampling.png)
We can use the `vfold_cv` function on the training data to implement the above strategy. Note that other resampling strategies are available in `tidymodels`.
```{r}
folds <- vfold_cv(train_data, v = 10)
folds$splits
```
You can also look at the actual `analysis` and `assessment` sets for each fold.
```{r}
folds$splits[[1]] %>% analysis()
folds$splits[[1]] %>% assessment()
```
We can now fit and evaluate our workflow on all resamples with one single call to the `fit_resamples` function. Note that also the steps in the recipe will be "trained" on the individual folds.
```{r}
microcredit_fit_rs <- microcredit_wflow %>%
fit_resamples(folds)
```
We can inspect the results of every single fold...
```{r}
microcredit_fit_rs$.metrics
```
... or aggregate the metrics over all folds.
```{r}
collect_metrics(microcredit_fit_rs)
```
When comparing the results of the cross validation procedure with the results on the test set, we can see a clear difference, i.e., the resampling results are usually more conservative.
```{r}
test_w_preds %>%
accuracy(truth = Decision, .pred_class)
test_w_preds %>%
roc_auc(truth = Decision, .pred_1)
```
The real beauty of `tidymodels` is that we can very easily re-run the whole experiment with a different model, like knn.
```{r}
knn_model <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
microcredit_wflow <- microcredit_wflow %>%
update_model(knn_model)
microcredit_fit_rs <- microcredit_wflow %>%
fit_resamples(folds)
collect_metrics(microcredit_fit_rs)
```
## 4. Hyperparameter Tuning
We will get back to this topic, once we have learned about models with hyperparameters...
---
title: "Data Science for Business - Week 05: Scoring Micro-Mortgages"
author: "Oliver Mueller"
output: html_notebook
editor_options:
markdown:
wrap: 72
---
## Initialize notebook
Load required packages.
```{r load packages, warning=FALSE, message=FALSE}
library(tidyverse)
theme_set(theme_classic())
library(tidymodels)
```
Set up workspace, i.e., remove all existing data from working memory and
initialize the random number generator with a fixed seed.
```{r setup}
rm(list=ls())
set.seed(42)
```
## Problem description
In India, there are about 20 million home loan (mortgage) aspirants
working in the informal sector: \* Monthly income between INR
20,000-25,000 (\$ 325-400) \* Typically no formal accounts and documents
(e.g., tax returns, income proofs, bank statements) \* Often use
services of money lenders with interest rates between 30 and 60% per
annum
Providing mortgages to this group of customers requires to quickly and
efficiently assess their creditworthiness. Due to a lack of formal
documents and objective data, most financial institutions perform
interview-based processes to decide about these loan requests:
Strength of the current process:
- Interview-based field assessment
- Relaxation of document requirements
Weaknesses of the current process:
- Costly (total transaction costs as high as 30% of loan volume)
- Subjective judgments; depends on individual skills and motivations
- Low reliability across branches and credit officers
- Risk of corruption and fraud
## Data
Read data from CSV file.
```{r}
data <- read_csv("micromortgages.csv")
```
Make initial train/test split.
```{r}
split <- initial_split(data, strata = Decision)
train <- training(split)
test <- testing(split)
```
## Model, recipe, and workflow
Define two models, logistic regression and knn, with *tidymodels*. Note
this are just specifications of models, no actual model fitting is
performed at this point.
```{r}
model_spec_logit <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
```
```{r}
model_spec_nn <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
```
Next, we define a recipe for data processing. The recipe is also just a
specification of steps that should be performed, they are not executed
here.
```{r}
rec <- recipe(Decision ~ ., data = train) %>%
step_mutate(Decision = as.factor(Decision)) %>%
step_scale(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
```
Sometimes, it's difficult to track and trace what a recipe actually
does. With the bake function we can execute the recipe on some data and
peak into the results.
```{r}
train_baked <- bake(prep(rec), new_data = train)
```
Finally, we can combine the model specification and recipe to a
workflow.
```{r}
wkf_logit <- workflow() %>%
add_recipe(rec) %>%
add_model(model_spec_logit)
```
```{r}
wkf_nn <- workflow() %>%
add_recipe(rec) %>%
add_model(model_spec_nn)
```
## Tuning and resampling
We are now (almots) ready to apply our workflows on the training data.
But recall that one of our learners, the knn learner, has a
hyperparameter that can be tuned. For tuning, we will define a search
grid and perform k-fold cross validation to systematically try out
different values for the hyperparameter.
Specify the cross-validation strategy.
```{r}
folds <- vfold_cv(train, v = 5, strata = Decision)
```
Set up a random search grid for hyperparameter tuning.
```{r}
grid_nn <- grid_random(
neighbors(),
size = 3
)
grid_nn
```
Perform the actual hyperparameter tuning.
```{r}
res_nn <- tune_grid(
wkf_nn,
resamples = folds,
grid = grid_nn,
control = control_grid(save_pred = TRUE)
)
```
Our logistic regression model does not have any hyperparameters;
nonetheless we can fit using cross-validation to get a sense of the
variance in its predictive performance.
```{r}
res_logit <- tune_grid(
wkf_logit,
resamples = folds,
control = control_grid(save_pred = TRUE)
)
```
## Results of Hyperparameter Tuning
Show accuracy metrics for the different knn models.
```{r}
collect_metrics(res_nn)
```
Which hyperparameters are best?
```{r}
show_best(res_nn, "roc_auc")
```
## Last Fit
Now that we know the best hyperparameters for the knn model, we refit it
on the complete training data (so that we don't waste 20% of our data
due to cross-validation).
Get parameters with best AUC.
```{r}
best_auc_nn <- select_best(res_nn, "roc_auc")
best_auc_nn
```
Pass these parameters to the workflow.
```{r}
wkf_nn_final <- finalize_workflow(
wkf_nn,
best_auc_nn
)
wkf_nn_final
```
And do a final fit of the workflow.
```{r}
wkf_nn_final_fit <- last_fit(wkf_nn_final, split)
collect_metrics(wkf_nn_final_fit)
```
And let's not forget about our logistic regression model. Here are
accuracy and AUC for this model.
```{r}
collect_metrics(res_logit)
```
Wow!
Week 05/resampling.png

281 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment