Load required packages.
library(tidyverse)
library(ggthemr)
library(stargazer)
library(Metrics)
library(splines)
library(mgcv)
Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting.
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 predictor variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to explain the final price of each home. Read more about this dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
From a theoretical perspective, real estate evaluation models are based on hedonic pricing models introduced by Lancaster and Rosen in the 1970s and 80s. The main idea of hedonic pricing models is that the overall price of a good can be estimated by decomposing it into its constituting characteristics, determining the value contribution of each individual characteristic, and summing these contributions up to obtain the final price. As we will see in this case study, hedonic pricing models can easily be implemented using multiple linear regression.
Load data from CSV files. For better reproducibility, we load fixed train and test sets from CSV files.
training <- read_csv("ameshousing_training.csv")
Rows: 2346 Columns: 77
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (42): MSSubClass, MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborh...
dbl (35): LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtF...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <- read_csv("ameshousing_test.csv")
Rows: 584 Columns: 77
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (42): MSSubClass, MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborh...
dbl (35): LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtF...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We have already used linear regression with polynomial terms in Week 4. However, for reasons of completeness we will quickly revisit the idea and its implementation in R.
fit_poly <- lm(SalePrice ~ poly(GrLivArea, 3, raw = TRUE), data = training)
stargazer(fit_poly, intercept.bottom = FALSE, single.row = TRUE, type = "text")
===========================================================
Dependent variable:
---------------------------
SalePrice
-----------------------------------------------------------
Constant 32,559.340** (16,240.740)
poly(GrLivArea, 3, raw = TRUE)1 73.527*** (26.862)
poly(GrLivArea, 3, raw = TRUE)2 0.021 (0.014)
poly(GrLivArea, 3, raw = TRUE)3 -0.00000 (0.00000)
-----------------------------------------------------------
Observations 2,346
R2 0.523
Adjusted R2 0.522
Residual Std. Error 54,306.370 (df = 2342)
F Statistic 855.075*** (df = 3; 2342)
===========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
As it is usually very difficult to interpret regression tables with polynomial terms, it is a good idea to plot the marginal effect of the predictor variable on the target.
# Create a grid of GrLivArea values
GrLivArea_range <- range(training$GrLivArea)
GrLivArea_grid <- data.frame(GrLivArea = seq(from = GrLivArea_range[1], to = GrLivArea_range[2]))
# Feed these values into the predict function; here we not only calculate predictions, but also their standard errors
preds <- predict(fit_poly, newdata = GrLivArea_grid, se.fit = TRUE)
# Gather the results in a data frame
sim_data <- data.frame(GrLivArea = GrLivArea_grid,
lci = preds$fit - 2 * preds$se.fit,
fit = preds$fit,
uci = preds$fit + 2 * preds$se.fit)
# Plot the data
ggplot(data = sim_data) +
geom_point(data = training, mapping = aes(x = GrLivArea, y = SalePrice), alpha = 0.3) +
# in geom_smooth we use 'identity' as we have already calculated the smooth function
geom_smooth(mapping = aes(x = GrLivArea, y = fit, ymin = lci, ymax = uci), stat = "identity")
NA
NA
Instead of fitting one global polynomial model, we can also divide the predictor space into regions and fit more or less complex individual functions to these regions. The simplest approach is to use piecewise constant functions.
fit_stepfun <- lm(SalePrice ~ cut(GrLivArea, 4), data = training)
Print regression table to see coefficient estimates, standard errors, p-values, and R2. The predictors in a step function regression are binary indicator variables (dummy variables) and, hence, are easy to interpret.
stargazer(fit_stepfun, intercept.bottom = FALSE, single.row = TRUE, type = "text")
================================================================
Dependent variable:
---------------------------
SalePrice
----------------------------------------------------------------
Constant 140,204.100*** (1,661.766)
cut(GrLivArea, 4)(1.5e+03,2.56e+03] 80,353.260*** (2,544.750)
cut(GrLivArea, 4)(2.56e+03,3.62e+03] 205,820.100*** (7,307.622)
cut(GrLivArea, 4)(3.62e+03,4.68e+03] 336,685.900*** (26,867.220)
----------------------------------------------------------------
Observations 2,346
R2 0.418
Adjusted R2 0.417
Residual Std. Error 59,961.910 (df = 2342)
F Statistic 561.064*** (df = 3; 2342)
================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
We can also manually set the regions for the step functions using the cut function.
break_points <- c(0, 1000, 2000, 6000)
fit_stepfun <- lm(SalePrice ~ cut(GrLivArea, breaks = break_points),
data = training)
stargazer(fit_stepfun, intercept.bottom = FALSE, single.row = TRUE, type = "text")
==============================================================================
Dependent variable:
---------------------------
SalePrice
------------------------------------------------------------------------------
Constant 112,599.800*** (3,389.495)
cut(GrLivArea, breaks = break_points)(1e+03,2e+03] 62,448.470*** (3,723.251)
cut(GrLivArea, breaks = break_points)(2e+03,6e+03] 173,256.300*** (4,950.424)
------------------------------------------------------------------------------
Observations 2,346
R2 0.351
Adjusted R2 0.350
Residual Std. Error 63,321.000 (df = 2343)
F Statistic 633.229*** (df = 2; 2343)
==============================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Although the regression coefficient in step function regression are easy to interpret, it is often enlightening to plot the predictions for a set of predictor variable values.
GrLivArea_range <- range(training$GrLivArea)
GrLivArea_grid <- data.frame(GrLivArea = seq(from = GrLivArea_range[1], to = GrLivArea_range[2]))
preds <- predict(fit_stepfun, newdata = GrLivArea_grid, se.fit = TRUE)
sim_data <- data.frame(GrLivArea = GrLivArea_grid,
lci = preds$fit - 2 * preds$se.fit,
fit = preds$fit,
uci = preds$fit + 2 * preds$se.fit)
ggplot(data = sim_data) +
geom_point(data = training, mapping = aes(x = GrLivArea, y = SalePrice), alpha = 0.3) +
geom_smooth(mapping = aes(x = GrLivArea, y = fit, ymin = lci, ymax = uci), stat = "identity") +
geom_vline(data = as.data.frame(break_points), mapping = aes(xintercept = break_points), color = "red", linetype="dashed")
NA
NA
Instead of using piecewise constant functions for the different regions, we can also use linear or polynomial (typically cubic) functions. The bs function can be used to define the knots (boundaries between regions) and the degree of the polynomial (e.g., 1 for linear, 2 for quadratic, 3 for cubic, …) for each predictor variable.
k <- c(1000, 2000)
fit_splines <- lm(SalePrice ~ bs(GrLivArea, knots = k, degree = 1), data = training)
How well does the model fit the training data?
summary(fit_splines)$r.squared
[1] 0.5224987
Note that it is very difficult, if not impossible, to interpret the individual regression coefficients and their combined effect on the target variable from the regression table. Hence, on typically plots the fitted values for set predefined predictor values.
GrLivArea_range <- range(training$GrLivArea)
GrLivArea_grid <- data.frame(GrLivArea = seq(from = GrLivArea_range[1], to = GrLivArea_range[2]))
preds <- predict(fit_splines, newdata = GrLivArea_grid, se.fit = TRUE)
sim_data <- data.frame(GrLivArea = GrLivArea_grid,
lci = preds$fit - 2 * preds$se.fit,
fit = preds$fit,
uci = preds$fit + 2 * preds$se.fit)
ggplot(data = sim_data) +
geom_point(data = training, mapping = aes(x = GrLivArea, y = SalePrice), alpha = 0.3) +
geom_smooth(mapping = aes(x = GrLivArea, y = fit, ymin = lci, ymax = uci), stat = "identity") +
geom_vline(data = as.data.frame(k), mapping = aes(xintercept = k), color = "red", linetype="dashed")
Smoothing splines take the idea of splines to the extreme. In broad terms, the idea is to fit a natural cubic spline with knots at all unique values of the predictor variable. This basically means to connect all the dots on a scatterplot with a cubic function, which of course leads to overfitting. Hence, besides a loss function, smoothing splines also have a penalty term (lambda
), which allows to control how jumpy or smooth the function should be. If lambda = 0, the function will be very jumpy and exactly interpolate the training observations, If lambda -> infinity, the function will be perfectly smooth, i.e., a straight line just like in traditional OLS regression. The optimal value for lambda is typically determined empirically via cross validation.
fit_smoothspline <- smooth.spline(y = training$SalePrice, x = training$GrLivArea)
# Note that we do not have to set lambda manually, it will be automatically determined through cross validation.
# But if you really wish, you can also manually define lambda
Look at the results of fitting the smoothing spline to the training data.
fit_smoothspline
Call:
smooth.spline(x = training$GrLivArea, y = training$SalePrice)
Smoothing Parameter spar= 0.6269862 lambda= 0.000007598069 (11 iterations)
Equivalent Degrees of Freedom (Df): 32.99474
Penalized Criterion (RSS): 4180566557911
GCV: 2891025719
Plot the smoothing spline and training data.
GrLivArea_range <- range(training$GrLivArea)
GrLivArea_grid <- data.frame(GrLivArea = seq(from = GrLivArea_range[1], to = GrLivArea_range[2]))
preds <- as.data.frame(predict(fit_smoothspline, newdata = GrLivArea_grid))
ggplot(data = preds) +
geom_point(data = training, mapping = aes(x = GrLivArea, y = SalePrice), alpha = 0.3) +
geom_line(mapping = aes(x = x, y = y), color = "red")
GAM allow to flexibly estimate a response variable by modeling it via different non-linear basis functions (e.g. polynomial regression, linear spline, smoothing spline) of each of the predictor variables, while maintaining the additivity assumption. Hence, it can be seen as a tool to unify the above introduced methods in one framework. In the example below, we estimate SalePrice
by a linear spline of OverallQual
with two automatically defined knots (as indicated by the function bs
with df
= 3; see documentation of bs() for details) and a smoothing spline of LotArea
(indicated by the function s
). The optimal lambda
values for the smoothing spline are automatically determined via cross validation.
b <- c(2, 5, 8)
fit_gam <- gam(SalePrice ~ bs(OverallQual, df = 3, degree = 1) +
s(LotArea),
data = training)
Let’s take a look at the regression tale - but don’t expect to be able to intuitively interpret the estimated coefficients.
summary(fit_gam)
Family: gaussian
Link function: identity
Formula:
SalePrice ~ bs(OverallQual, df = 3, degree = 1) + s(LotArea)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45556 7873 5.787 0.00000000815 ***
bs(OverallQual, df = 3, degree = 1)1 89102 8357 10.662 < 0.0000000000000002 ***
bs(OverallQual, df = 3, degree = 1)2 154014 7933 19.415 < 0.0000000000000002 ***
bs(OverallQual, df = 3, degree = 1)3 368507 9154 40.257 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(LotArea) 8.094 8.749 75.68 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.757 Deviance explained = 75.9%
GCV = 1.5049e+09 Scale est. = 1.4971e+09 n = 2346
Instead, we plot the model again. The plot.gam
function does not produce the nicest plots, but saves us a lot of time for manually simulating and plotting data.
plot.gam(fit_gam, se=TRUE, all.terms = TRUE)
NA
Use the GAM to make predictions on the test set and calculate RMSE.
test_preds_gam <- predict(fit_gam, newdata = test)
rmse(test_preds_gam, test$SalePrice)
[1] 41609.51
Compare the GAM mode with standard OLS regression using the same predictors.
fit_OLS <- lm(SalePrice ~ OverallQual + LotArea, data = training)
test_preds_OLS <- predict(fit_OLS, newdata = test)
rmse(test_preds_OLS, test$SalePrice)
[1] 47278.11