Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting.
In India, there are about 20 million home loan (mortgage) aspirants working in the informal sector:
- Monthly income between INR 20,000-25,000 (\$ 325-400)
- Typically no formal accounts and documents (e.g., tax returns, income proofs, bank statements)
- Often use services of money lenders with interest rates between 30 and 60% per annum
Providing mortgages to this group of customers requires to quickly and efficiently assess their creditworthiness. Due to a lack of formal documents and objective data, most financial institutions perform interview-based processes to decide about these loan requests.
Strengths of the current process:
- Interview-based field assessment
- Relaxation of document requirements
Weaknesses of the current process:
- Costly (total transaction costs as high as 30% of loan volume)
- Subjective judgments; depends on individual skills and motivations
- Low reliability across branches and credit officers
- Risk of corruption and fraud
To develop a classification model we have historical data about approx. 1,200 applications and the corresponding decisions made by credit officers. A detailed description of the dataset can be found here in the file `variables.pdf`.
# Load Data
Read data from CSV file.
```{r}
data <- read_csv("micromortgages.csv")
```
# Explore Data
Before plotting our data, we have to do some minimal preprocessing (i.e., code the response variable as a factor).
```{r}
data$Decision <- as.factor(data$Decision)
```
Now, we can plot the distribution of the the response variable `Decision`.
The tidymodels framework is a collection of packages for modeling and machine learning using tidy principles.
Tidymodels defines a standard interface and workflow to:
1. Build a model using different algorithms and engines
2. Preprocess your data
3. Evaluate your model with resampling
4. Tune model hyperparameters
Here (<https://www.tidymodels.org/start/>) you can find a very good getting started guide for tidymodels.
## 1. Build a model
Let's first look at how we can specify and fit a simple logistic regression model with `tidymodels`.
First, we specify the model we want (`logistic_reg`) and the engine to be used to (later) fit the model.
```{r}
logit_mod <- logistic_reg() %>%
set_engine("glm")
```
Now, we actually `fit` the model to data. Here, we have to define the data and the predictor and response variables (using R's usual formula interface).
```{r}
logit_fit <- logit_mod %>%
fit(Decision ~ Gender + TotInc, data = data)
logit_fit
```
With the `predict` function, we can apply the fitted model to make predictions on data.
```{r}
preds_prob <- predict(logit_fit, new_data = data, type = "prob")
preds_label <- predict(logit_fit, new_data = data, type = "class")
data_w_preds <-
data %>%
bind_cols(preds_prob, preds_label)
```
There are multiple evaluation metrics, like `accuracy` and AUC (`roc_auc`).
```{r}
data_w_preds %>%
accuracy(truth = Decision, .pred_class)
data_w_preds %>%
roc_auc(truth = Decision, .pred_1)
```
## 2. Preprocess data
So far, the modeling process was not very different from the usual process. Now, we will some more advanced features of `tidymodels`. Maybe the most important part is preprocessing the data and doing "feature engineering".
First, we will make a classic 80/20 train/test split.
```{r}
data_split <- initial_split(data, prop = .8)
train_data <- training(data_split)
test_data <- testing(data_split)
```
Next, we create a so-called `recipe` to define a series of preprocessing steps. Note that the `data` passed to `recipe` does not need to be the complete data that will be used to train the steps. The recipe only needs to know the names and types of data that will be used. For large data sets, `head` could be used to pass the recipe a smaller data set to save time and memory.
```{r}
prep_recipe <-
recipe(Decision ~ Gender + TotInc, data = train_data)
```
```{r}
summary(prep_recipe)
```
Let's add some actual steps to the recipe. We first normalize all numerical variables (step\_normalize), then dummy code (`step_dummy`) all nominal variables, except for the outcome variable, and finally remove predictors with almost no variance (`step_nzv`).
```{r}
prep_recipe <-
recipe(Decision ~ Gender + TotInc, data = train_data) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_nzv(all_predictors())
```
A `workflow` allows us to glue the model and the recipe together...
```{r}
microcredit_wflow <- workflow() %>%
add_model(logit_mod) %>%
add_recipe(prep_recipe)
microcredit_wflow
```
... and `fit` the result on some data.
```{r}
microcredit_fit <-
microcredit_wflow %>%
fit(data = train_data)
```
```{r}
microcredit_fit
```
We can now use the fitted model like usual.
```{r}
preds_prob <- predict(microcredit_fit, new_data = test_data, type = "prob")
preds_label <- predict(microcredit_fit, new_data = test_data, type = "class")
test_w_preds <-
test_data %>%
bind_cols(preds_prob, preds_label)
```
```{r}
test_w_preds %>%
accuracy(truth = Decision, .pred_class)
test_w_preds %>%
roc_auc(truth = Decision, .pred_1)
```
## 3. Evaluation with Resampling
Using a single train/test split can lead to unreliable performance estimates. Hence, we usually perform cross-validation. The figure below illustrates our overall data splitting strategy.

We can use the `vfold_cv` function on the training data to implement the above strategy. Note that other resampling strategies are available in `tidymodels`.
```{r}
folds <- vfold_cv(train_data, v = 10)
folds$splits
```
You can also look at the actual `analysis` and `assessment` sets for each fold.
```{r}
folds$splits[[1]] %>% analysis()
folds$splits[[1]] %>% assessment()
```
We can now fit and evaluate our workflow on all resamples with one single call to the `fit_resamples` function. Note that also the steps in the recipe will be "trained" on the individual folds.
```{r}
microcredit_fit_rs <- microcredit_wflow %>%
fit_resamples(folds)
```
We can inspect the results of every single fold...
```{r}
microcredit_fit_rs$.metrics
```
... or aggregate the metrics over all folds.
```{r}
collect_metrics(microcredit_fit_rs)
```
When comparing the results of the cross validation procedure with the results on the test set, we can see a clear difference, i.e., the resampling results are usually more conservative.
```{r}
test_w_preds %>%
accuracy(truth = Decision, .pred_class)
test_w_preds %>%
roc_auc(truth = Decision, .pred_1)
```
The real beauty of `tidymodels` is that we can very easily re-run the whole experiment with a different model, like knn.
```{r}
knn_model <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
microcredit_wflow <- microcredit_wflow %>%
update_model(knn_model)
microcredit_fit_rs <- microcredit_wflow %>%
fit_resamples(folds)
collect_metrics(microcredit_fit_rs)
```
## 4. Hyperparameter Tuning
We will get back to this topic, once we have learned about models with hyperparameters...