Source: https://christophm.github.io/interpretable-ml-book
Load required packages.
library(tidyverse)
library(ggthemr)
library(caret)
library(ranger)
library(Metrics)
library(iml)
Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting. Finally, we activate parallel processing on multiple CPU cores.
rm(list=ls())
set.seed(42)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
options(scipen=10000)
ggthemr('fresh')
doParallel::registerDoParallel(cores = 4)
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Load data from CSV file.
data <- read_csv("ameshousing.csv")
Rows: 2930 Columns: 77
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (42): MSSubClass, MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, Bl...
dbl (35): LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Fix some data type issues, i.e., turn character variables into factors and remove some irrelevant predictors.
data <- data %>%
mutate_if(is.character,as.factor) %>%
select(-MSSubClass, -MSZoning, -Exterior1st, -Exterior2nd, -SaleType, -SaleCondition, -MiscVal, -MoSold)
Create train/test split.
training_rows <- createDataPartition(y=data$SalePrice, p = 0.8, list = FALSE)
data_training <- slice(data, training_rows)
data_test <- slice(data, -training_rows)
Fit a linear model on all features.
lm_fit <- lm(SalePrice ~ ., data = data_training)
Display regression coefficients.
summary(lm_fit)
Call:
lm(formula = SalePrice ~ ., data = data_training)
Residuals:
Min 1Q Median 3Q Max
-406788 -9957 82 9436 179529
Coefficients: (7 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 131577.1307 823477.1409 0.160 0.873068
LotFrontage 21.1493 19.2855 1.097 0.272921
LotArea 0.6323 0.1044 6.059 0.0000000016150362 ***
StreetPave 17669.5855 9961.6567 1.774 0.076245 .
Alleynone 664.3902 3231.7793 0.206 0.837138
AlleyPave -1009.9021 4752.0618 -0.213 0.831723
LotShapeIR2 6831.6885 3676.5826 1.858 0.063283 .
LotShapeIR3 6309.4710 7700.1506 0.819 0.412652
LotShapeReg 2303.3460 1360.6998 1.693 0.090645 .
LandContourHLS 18860.8978 4241.3481 4.447 0.0000091568778017 ***
LandContourLow -626.4191 5340.9399 -0.117 0.906644
LandContourLvl 11579.7587 3026.4122 3.826 0.000134 ***
UtilitiesNoSeWa -21033.6294 26284.6161 -0.800 0.423669
UtilitiesNoSewr 8691.6846 18899.5029 0.460 0.645642
LotConfigCulDSac 6226.8024 2762.1547 2.254 0.024276 *
LotConfigFR2 -6808.8515 3486.7761 -1.953 0.050978 .
LotConfigFR3 -2140.7264 7739.4546 -0.277 0.782115
LotConfigInside 413.6771 1465.1482 0.282 0.777706
LandSlopeMod 7804.2392 3341.1421 2.336 0.019594 *
LandSlopeSev -31034.1293 10484.7794 -2.960 0.003111 **
NeighborhoodBlueste -5485.4272 11660.0476 -0.470 0.638084
NeighborhoodBrDale 2992.7338 8921.1018 0.335 0.737306
NeighborhoodBrkSide -14142.0168 7572.8645 -1.867 0.061974 .
NeighborhoodClearCr -13686.0612 8156.3040 -1.678 0.093498 .
NeighborhoodCollgCr -13819.5256 6415.7289 -2.154 0.031351 *
NeighborhoodCrawfor -636.4693 7223.6304 -0.088 0.929798
NeighborhoodEdwards -25393.1955 6875.4032 -3.693 0.000227 ***
NeighborhoodGilbert -16353.1014 6744.9275 -2.425 0.015412 *
NeighborhoodGreens 2018.6115 12355.1098 0.163 0.870233
NeighborhoodGrnHill 135468.7260 25809.4200 5.249 0.0000001681958060 ***
NeighborhoodIDOTRR -23153.1479 7648.0992 -3.027 0.002497 **
NeighborhoodLandmrk -3438.2105 25670.0615 -0.134 0.893464
NeighborhoodMeadowV -8747.9193 8307.0341 -1.053 0.292426
NeighborhoodMitchel -19888.1334 6933.5747 -2.868 0.004166 **
NeighborhoodNAmes -19753.4041 6786.1205 -2.911 0.003642 **
NeighborhoodNoRidge 29954.3669 7335.8275 4.083 0.0000460322345043 ***
NeighborhoodNPkVill 6201.8221 8705.8592 0.712 0.476311
NeighborhoodNridgHt 22499.3828 6642.5035 3.387 0.000719 ***
NeighborhoodNWAmes -21241.2745 6937.1943 -3.062 0.002226 **
NeighborhoodOldTown -25484.0482 7312.7278 -3.485 0.000502 ***
NeighborhoodSawyer -15200.5130 7001.7400 -2.171 0.030044 *
NeighborhoodSawyerW -14440.6336 6749.3701 -2.140 0.032504 *
NeighborhoodSomerst 9993.7009 6591.3260 1.516 0.129619
NeighborhoodStoneBr 27723.7535 7379.3163 3.757 0.000177 ***
NeighborhoodSWISU -17289.0401 8333.2353 -2.075 0.038133 *
NeighborhoodTimber -11170.0960 7138.1658 -1.565 0.117768
NeighborhoodVeenker -7217.5111 8833.9136 -0.817 0.414006
Condition1Feedr 1991.8526 4013.3951 0.496 0.619733
Condition1Norm 11266.3971 3340.7666 3.372 0.000758 ***
Condition1PosA 12330.6113 8100.3956 1.522 0.128101
Condition1PosN 23102.0009 5746.0530 4.020 0.0000600849749277 ***
Condition1RRAe -3763.9418 6704.7203 -0.561 0.574593
Condition1RRAn 6165.6291 5488.3071 1.123 0.261389
Condition1RRNe 4079.6186 13297.7044 0.307 0.759032
Condition1RRNn -2623.3735 12946.6537 -0.203 0.839444
Condition2Feedr -4248.8419 15564.9445 -0.273 0.784899
Condition2Norm -3459.8301 13383.5384 -0.259 0.796035
Condition2PosA 78630.5657 23461.0517 3.352 0.000818 ***
Condition2PosN -129662.8328 20498.3421 -6.326 0.0000000003063132 ***
Condition2RRAe -15203.8113 37809.1587 -0.402 0.687636
Condition2RRAn 711.0949 28502.2368 0.025 0.980098
Condition2RRNn -2967.0777 22542.8465 -0.132 0.895298
BldgType2fmCon -8547.9570 4550.9686 -1.878 0.060480 .
BldgTypeDuplex -8755.8864 5468.0373 -1.601 0.109461
BldgTypeTwnhs -33035.0180 4409.9702 -7.491 0.0000000000000994 ***
BldgTypeTwnhsE -24890.3681 2913.8486 -8.542 < 0.0000000000000002 ***
HouseStyle1.5Unf 11889.0049 7209.0554 1.649 0.099258 .
HouseStyle1Story 12194.6960 2964.4228 4.114 0.0000404115535015 ***
HouseStyle2.5Fin -29459.2362 13721.9210 -2.147 0.031915 *
HouseStyle2.5Unf 2051.1682 6548.8036 0.313 0.754150
HouseStyle2Story -5257.2460 2585.2720 -2.034 0.042121 *
HouseStyleSFoyer 4686.7154 4677.1713 1.002 0.316437
HouseStyleSLvl 2645.9104 3828.8869 0.691 0.489616
OverallQual 6810.0108 802.1970 8.489 < 0.0000000000000002 ***
OverallCond 5897.9892 706.0844 8.353 < 0.0000000000000002 ***
YearBuilt 222.6986 59.0092 3.774 0.000165 ***
YearRemodAdd 66.7363 44.2448 1.508 0.131615
RoofStyleGable 15187.2265 13251.3750 1.146 0.251887
RoofStyleGambrel 20295.8921 14646.0087 1.386 0.165964
RoofStyleHip 16784.8075 13341.9707 1.258 0.208512
RoofStyleMansard 1021.3063 15907.5396 0.064 0.948815
RoofStyleShed 16392.6544 26663.8659 0.615 0.538759
RoofMatlCompShg 551847.8645 30490.4352 18.099 < 0.0000000000000002 ***
RoofMatlMetal 606736.5639 42883.4945 14.148 < 0.0000000000000002 ***
RoofMatlRoll 561775.1857 39666.7182 14.162 < 0.0000000000000002 ***
RoofMatlTar&Grv 561213.2952 32529.6657 17.252 < 0.0000000000000002 ***
RoofMatlWdShake 550848.6415 32194.8551 17.110 < 0.0000000000000002 ***
RoofMatlWdShngl 674279.2205 36009.3200 18.725 < 0.0000000000000002 ***
MasVnrTypeBrkFace 3993.9865 6063.9130 0.659 0.510192
MasVnrTypeCBlock -59517.4286 35715.7384 -1.666 0.095776 .
MasVnrTypenone 5251.7401 8386.0737 0.626 0.531221
MasVnrTypeNone 7143.2877 6073.4108 1.176 0.239663
MasVnrTypeStone 6188.0979 6322.8162 0.979 0.327842
MasVnrArea 10.8310 5.0305 2.153 0.031425 *
ExterQualFa -23516.2234 7694.8111 -3.056 0.002270 **
ExterQualGd -25890.4792 3959.4028 -6.539 0.0000000000772073 ***
ExterQualTA -29047.2925 4435.8143 -6.548 0.0000000000726051 ***
ExterCondFa -2872.1944 10018.2995 -0.287 0.774374
ExterCondGd 2971.7605 9151.6533 0.325 0.745422
ExterCondPo -7081.4842 20613.4749 -0.344 0.731228
ExterCondTA 3888.4895 9065.0701 0.429 0.668001
FoundationCBlock 269.0556 2485.1273 0.108 0.913795
FoundationPConc 4521.8863 2741.0044 1.650 0.099147 .
FoundationSlab 3433.3545 7200.5323 0.477 0.633539
FoundationStone 8865.5328 9724.5331 0.912 0.362047
FoundationWood -4568.5388 15150.2535 -0.302 0.763025
BsmtQualFa -16660.3977 4828.9512 -3.450 0.000571 ***
BsmtQualGd -20156.7648 2729.6226 -7.384 0.0000000000002180 ***
BsmtQualnone -2134.3874 24987.6290 -0.085 0.931937
BsmtQualPo -12960.4448 21303.6750 -0.608 0.543009
BsmtQualTA -18874.8026 3409.2095 -5.536 0.0000000346568150 ***
BsmtCondFa 3292.2658 18142.3220 0.181 0.856017
BsmtCondGd 3377.1412 18064.3948 0.187 0.851717
BsmtCondnone NA NA NA NA
BsmtCondPo 24991.7197 26369.2947 0.948 0.343359
BsmtCondTA 2364.6003 17861.9327 0.132 0.894694
BsmtExposureGd 10646.4239 2440.3778 4.363 0.0000134617837265 ***
BsmtExposureMn -9390.1293 2475.6480 -3.793 0.000153 ***
BsmtExposureNo -8802.0734 1857.7668 -4.738 0.0000023001189487 ***
BsmtExposurenone -15132.7717 14391.0465 -1.052 0.293129
BsmtFinType1BLQ 23.3527 2302.8564 0.010 0.991910
BsmtFinType1GLQ 1963.5544 2071.6028 0.948 0.343316
BsmtFinType1LwQ -3916.5990 2894.5383 -1.353 0.176167
BsmtFinType1none NA NA NA NA
BsmtFinType1Rec -476.2670 2378.8807 -0.200 0.841338
BsmtFinType1Unf 113.6525 2361.6674 0.048 0.961622
BsmtFinSF1 28.0607 3.8994 7.196 0.0000000000008526 ***
BsmtFinType2BLQ -2285.8789 5490.6914 -0.416 0.677218
BsmtFinType2GLQ 5099.4234 6887.1331 0.740 0.459122
BsmtFinType2LwQ -4243.8420 5305.6623 -0.800 0.423875
BsmtFinType2none NA NA NA NA
BsmtFinType2Rec -1543.1538 5207.5363 -0.296 0.767006
BsmtFinType2Unf 2931.1412 5261.1981 0.557 0.577501
BsmtFinSF2 31.5440 6.7718 4.658 0.0000033861408377 ***
BsmtUnfSF 14.2267 3.5784 3.976 0.0000724945122263 ***
TotalBsmtSF NA NA NA NA
HeatingGasA -6717.9275 25975.3311 -0.259 0.795948
HeatingGasW -1361.2956 26497.4473 -0.051 0.959032
HeatingGrav -14307.8783 27869.7289 -0.513 0.607736
HeatingOthW -43210.6266 31869.7380 -1.356 0.175289
HeatingWall 7397.1416 29323.7992 0.252 0.800866
HeatingQCFa -591.4595 3675.0768 -0.161 0.872157
HeatingQCGd -2057.1082 1676.3982 -1.227 0.219920
HeatingQCPo -51472.7460 44881.7962 -1.147 0.251571
HeatingQCTA -3533.9748 1651.5354 -2.140 0.032483 *
CentralAirY -311.8612 2906.1391 -0.107 0.914552
ElectricalFuseF 1887.8064 4708.7534 0.401 0.688523
ElectricalFuseP -277.9348 9880.4680 -0.028 0.977561
ElectricalMix 794.8354 37147.4445 0.021 0.982931
ElectricalSBrkr -408.7491 2361.6989 -0.173 0.862609
FirstFlrSF 42.9278 4.0350 10.639 < 0.0000000000000002 ***
SecondFlrSF 62.8987 4.4237 14.218 < 0.0000000000000002 ***
LowQualFinSF 44.0808 13.3068 3.313 0.000939 ***
GrLivArea NA NA NA NA
BsmtFullBath 3367.4572 1559.0900 2.160 0.030892 *
BsmtHalfBath -5191.9846 2399.1910 -2.164 0.030570 *
FullBath 6883.5741 1773.6297 3.881 0.000107 ***
HalfBath 3943.0530 1679.9289 2.347 0.019008 *
BedroomAbvGr -2268.1559 1080.6592 -2.099 0.035946 *
KitchenAbvGr -15268.9538 4960.0325 -3.078 0.002108 **
KitchenQualFa -17906.5841 4982.9569 -3.594 0.000334 ***
KitchenQualGd -18639.9689 2858.5923 -6.521 0.0000000000870358 ***
KitchenQualTA -20054.5711 3198.6458 -6.270 0.0000000004361825 ***
TotRmsAbvGrd -455.9306 744.4026 -0.612 0.540286
FunctionalMaj2 -17853.0922 12080.8024 -1.478 0.139607
FunctionalMin1 690.3888 7823.2064 0.088 0.929687
FunctionalMin2 1834.0859 7798.2481 0.235 0.814082
FunctionalMod 792.2803 8428.6469 0.094 0.925119
FunctionalSal -5456.4197 27160.8608 -0.201 0.840802
FunctionalSev -24940.2358 20232.5183 -1.233 0.217830
FunctionalTyp 9667.7856 7029.2693 1.375 0.169163
Fireplaces 7786.0593 2056.6131 3.786 0.000157 ***
FireplaceQuFa -14339.0443 5917.7659 -2.423 0.015473 *
FireplaceQuGd -7969.7391 4800.0345 -1.660 0.096990 .
FireplaceQunone -6146.1535 5378.2358 -1.143 0.253257
FireplaceQuPo -13134.7656 6493.3906 -2.023 0.043219 *
FireplaceQuTA -10801.4757 4929.1155 -2.191 0.028533 *
GarageTypeAttchd 11893.0112 6831.1709 1.741 0.081829 .
GarageTypeBasment 12514.0352 8579.8356 1.459 0.144838
GarageTypeBuiltIn 9505.1865 7321.4981 1.298 0.194339
GarageTypeCarPort 6327.3163 10176.9946 0.622 0.534187
GarageTypeDetchd 11543.0317 6812.9192 1.694 0.090356 .
GarageTypenone 23715.7690 19446.3665 1.220 0.222771
GarageFinishnone -1929.3846 24439.4893 -0.079 0.937083
GarageFinishRFn -3739.4538 1598.3264 -2.340 0.019396 *
GarageFinishUnf -1549.9336 1928.8635 -0.804 0.421747
GarageCars 3689.9879 1841.8736 2.003 0.045261 *
GarageArea 22.2360 6.2100 3.581 0.000350 ***
GarageQualFa -36930.9691 23900.1038 -1.545 0.122440
GarageQualGd -20919.8202 23460.4248 -0.892 0.372650
GarageQualnone NA NA NA NA
GarageQualPo -42456.5374 30125.1635 -1.409 0.158880
GarageQualTA -37901.2894 23681.2020 -1.600 0.109640
GarageCondFa 35808.4419 22174.7204 1.615 0.106494
GarageCondGd 34813.8686 22949.3341 1.517 0.129417
GarageCondnone NA NA NA NA
GarageCondPo 38257.9514 24277.1162 1.576 0.115200
GarageCondTA 39464.5159 21798.4667 1.810 0.070370 .
PavedDriveP -2524.1790 4312.2529 -0.585 0.558374
PavedDriveY 1469.3221 2691.9176 0.546 0.585241
[ reached getOption("max.print") -- omitted 11 rows ]
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24410 on 2142 degrees of freedom
Multiple R-squared: 0.9152, Adjusted R-squared: 0.9072
F-statistic: 113.9 on 203 and 2142 DF, p-value: < 0.00000000000000022
Make predictions on test set and calculate RMSE.
lm_preds <- predict(lm_fit, data = data_test)
rmse(data_test$SalePrice, lm_preds)
[1] 109610.3
Fit a random forest model on all features using the ranger
package.
ranger_fit <- ranger(SalePrice ~ ., data = data_training)
Make predictions on test set and calculate RMSE.
ranger_preds <- predict(ranger_fit, data = data_test)
ranger_preds <- ranger_preds$predictions
rmse(data_test$SalePrice, ranger_preds)
[1] 21455.41
Source: https://christophm.github.io/interpretable-ml-book
The following content is based on Chapter 16 of “Hands-On Machine Learning with R” by Boehmke and Greenwell (https://bradleyboehmke.github.io/HOML/iml.html).
The iml
package implements various model-agnostic post-hoc explainability methods. However, it needs a bit of setup.
# define a wrapper around the predict function of the ranger package
ranger_predict <- function(model, newdata) {
results <- predict(model, newdata)
results <- as.data.frame(results$predictions)
return(results)
}
# store the features and targets in separate variables
X <- data %>% select(-SalePrice)
y <- data %>% select(SalePrice)
# create IML predictor object for ranger
ranger_predictor <- Predictor$new(
model = ranger_fit,
data = X,
y = y,
predict.fun = ranger_predict,
class = "regression"
)
Permutation-based feature importance measures a feature’s importance by calculating the increase of the model’s prediction error after permuting the values of a given feature. The idea is that if we randomly permute the values of an important feature in the training data, the training performance would degrade (since permuting the values of a feature effectively destroys any relationship between that feature and the target variable). The permutation approach uses the difference (or ratio) between some baseline performance measure (e.g., RMSE) and the same performance measure obtained after permuting the values of a particular feature in the training data.
Algorithm 1: A simple algorithm for computing permutation-based variable importance for the feature set X.
1. Compute accuracy of the original model
2. For feature i in {1,...,p} do
| Permute values of feature i
| Make predictions for all observations in the training set
| Compute accuracy of permuted predicitons
| Calculate feature importance, i.e., difference or ratio
between permuted and original accuracy
End
3. Sort variables by descending feature importance
Calculate and plot permutation-based feature importance.
imp <- FeatureImp$new(ranger_predictor, loss = "rmse", compare = "difference",
n.repetitions = 1)
plot(imp)
To create a PDP we will split the feature of interest into j equally spaced values. For example, the GrLivArea
feature ranges from 334 to 5095 square feet. Say the user selects j=20. We will then first create an evenly spaced grid consisting of 20 values across the distribution of GrLivArea
(e.g., 334.00, 584.58,… , 5095.00). Then we will make 20 copies of the original training data (one copy for each value in the grid). The algorithm will then set GrLivArea
for all observations in the first copy to 334, 585 in the second copy, 835 in the third copy, …, and finally to 5095 in the 20-th copy (all other features remain unchanged). The algorithm then predicts the outcome for each observation in each of the 20 copies, and then averages the predicted values for each set. These averaged predicted values are known as partial dependence values and are plotted against the 20 evenly spaced values for GrLivArea
.
Algorithm 2: A simple algorithm for constructing the partial dependence plot of the response on a single predictor x
.
For a selected predictor (x)
1. Construct a grid of j evenly spaced values across the distribution
of x: {x1, x2, ..., xj}
2. For i in {1,...,j} do
| Copy the training data and replace all original values of x
with the constant xi
| Compute predictions for all observations
| Average predictions
End
3. Plot the averaged predictions against x1, x2, ..., xj
Calculate and plot a partial dependence plot (PDP) with the IML package.
pdp <- FeatureEffect$new(ranger_predictor, method = "pdp", feature = "GrLivArea", grid.size = 20)
plot(pdp)
An ICE plot visualizes the dependence of the predicted response on a feature for each instance separately, resulting in multiple lines, one for each observation, compared to one line in partial dependence plots. A PDP is the average of the lines of an ICE plot. Note that the following algorithm is the same as the PDP algorithms except for the last line where PDPs averaged the predicted values.
Algorithm 3: A simple algorithm for constructing the individual conditional expectation of the response on a single predictor x
.
For a selected predictor (x)
1. Construct a grid of j evenly spaced values across the distribution
of x: {x1, x2, ..., xj}
2. For i in {1,...,j} do
| Copy the training data and replace the original values of x
with the constant xi
| Compute predictions for all observations
End
3. Plot the predictions against x1, x2, ..., xj with lines connecting
oberservations that correspond to the same row number in the original
training data
Calculate and plot ICE plot for all observations in the test set.
ice <- FeatureEffect$new(ranger_predictor, method = "pdp+ice",
feature = "GrLivArea")
plot(ice)
Another method for explaining individual predictions borrows ideas from game theory to produce whats called Shapley values.
The concept of Shapley values is based on the idea that the feature values of an individual observation work together to cause a change in the model’s prediction with respect to the model’s expected output, and it divides this total change in prediction among the feature values in a way that is “fair” to their contributions across all possible subsets of features.
Source: https://christophm.github.io/interpretable-ml-book
To calculate Shapley values, we assess every combination of predictor values to determine each predictor value’s impact. Focusing on feature xj
and its value for the observation of interest, the approach will test how adding the feature value of xj
to every possible combination of values of the other features changes the prediction on average.
For complex models, computing exact Shapley values is computationally expensive and most implementations use sampling-based heuristics to calculate approximate Shapley values. The example below shows a one sample repetition of estimating the impact of “cat banned” to the coalition of “park nearby”, “50sqm”, and “1st floor”. For this sample, the impact of “cat banned” is €-10.000. Repeating this process multiple times by holding the feature value “cat banned” fixed and combining it with other simulated coalitions yields an approximation of the Shapley value of “cat banned”.
Source: https://christophm.github.io/interpretable-ml-book
Calculate and plot Shapley values for the first observation in the training set.
shapley <- Shapley$new(ranger_predictor, x.interest = data_training[1,], sample.size = 10)
plot(shapley)