Chapter 6 Interactions and qualitative predictors

6.1 Transformations

Formulae in R linear models are much more powerful than what seen so far. In some cases we might be interested in transforming a variable before fitting the linear model.

6.1.1 A simple example

For instance, let’s take this synthetic dataset:

library(tibble)
n <- 100
synth <- tibble(
  x = seq(from = 1, to = 30, length.out = n),
  y = log(x) + rnorm(n, 0, 0.2)
)

which looks like this:

library(ggplot2)
ggplot(synth) +
  geom_point(aes(x, y))

Of course, we can try to fit a linear model without any extra effort

naive_lm <- lm(y ~ x, data = synth)
ggplot(naive_lm, mapping = aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula = 'y ~ x'

but if we plot the residuals, we can detect some issues for low values of \(y\) (definitely not uncorrelated).

library(dplyr)
synth %>%
  mutate(res = naive_lm$residuals) %>% # adds the residual column
  ggplot() +
  geom_point(aes(y, res))

By understanding how \(x\) is distributed, we can fix this issue and fit the model on a transformation of itself, clearly \(\log(x)\). We do this simply by adding the desired transformation in the formula, meaning that we don’t have to tranform the dataset beforehand.

log_lm <- lm(y ~ log(x), data = synth)
ggplot(log_lm,
  mapping = aes(`log(x)`, y)
) + # notice the backticks!
  geom_point() +
  geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula = 'y ~ x'

and the residuals plot.

synth %>%
  mutate(res = log_lm$residuals) %>% # adds the residual column
  ggplot() +
  geom_point(aes(y, res))

6.1.2 On advertising

Back to our real dataset.

library(readr)
advertising <- read_csv("./datasets/advertising.csv")
## Rows: 200 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): TV, Radio, Newspaper, Sales
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s try applying a transformation to the most promising predictor, in particular let’s us \(\sqrt{TV}\). The choice of the square root transformation comes from a first look at the scatter plot shown previously (TV against Sales), where we can detect a slightly curved trend which resembles a curve \(y = \sqrt{x}\).

Let’s plot the data after the transformation and observe that its trend better fit a straight line.

advertising %>%
  dplyr::select(Sales, TV) %>%
  dplyr::mutate(sqrtTV = sqrt(TV)) %>%
  ggplot(aes(sqrtTV, Sales)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula = 'y ~ x'

We can fit a linear model. Its summary table will show a better \(R^2\) score.

trans_lm <- lm(Sales ~ sqrt(TV), data = advertising)

Exercise: plot the regression line and compare it with the simple model fitted on the raw data

The following two commands might help in inspecting a linear model with transformed data.

head(model.matrix(trans_lm)) # prints the design matrix
##   (Intercept)  sqrt(TV)
## 1           1 15.169047
## 2           1  6.670832
## 3           1  4.147288
## 4           1 12.308534
## 5           1 13.446189
## 6           1  2.949576
trans_lm$terms # inspect the linear model specifics
## Sales ~ sqrt(TV)
## attr(,"variables")
## list(Sales, sqrt(TV))
## attr(,"factors")
##          sqrt(TV)
## Sales           0
## sqrt(TV)        1
## attr(,"term.labels")
## [1] "sqrt(TV)"
## attr(,"order")
## [1] 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## attr(,"predvars")
## list(Sales, sqrt(TV))
## attr(,"dataClasses")
##     Sales  sqrt(TV) 
## "numeric" "numeric"

6.2 Model selection

6.2.1 R-squared

One of the indicators computed with summary on a fitted linear model is the \(R^2\) index (R-squared, or coefficient of determination). It is defined as

\[ R^2 = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2} \]

and shows the impact of the residuals proportionately to the variance of the response variable. It can be used to compare different models although it should not be considered as an absolute score of the model. The closer it is to 1, the better is the model fit.

This output, for instance, shows how the transformation used above seems to improve the accuracy of the regression task.

simple_lm <- lm(Sales ~ TV, data = advertising)
summary(simple_lm)$r.squared
## [1] 0.8121757
summary(trans_lm)$r.squared
## [1] 0.8216696

6.2.2 ANOVA

Another way of comparing two models, in particular one model with a smaller nested model, is the ANOVA test, which is an instance of the F-test, with statistics

\[ F = \frac{\left(\frac{RSS_1 - RSS_2}{p_2 - p_1}\right)}{\left(\frac{RSS_2}{n - p_2}\right)} \]

where \(RSS\) is the residual sum of squares and \(p_2 > p_1\) (model 1 is smaller). We know from theory that \(F \sim F(p_2 - p_1, n - p_2)\)

A low p-value for the F statistic means that we can reject the hypothesis that the smaller model explains the data well enough.

Here we compare the model with a single predictor (squared root of TV), which is the smaller model, with a model with also the Radio data as additional predictor.

double_lm <- lm(Sales ~ sqrt(TV) + Radio, data = advertising)
anova(trans_lm, double_lm)
## Analysis of Variance Table
## 
## Model 1: Sales ~ sqrt(TV)
## Model 2: Sales ~ sqrt(TV) + Radio
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    198 990.80                                  
## 2    197 410.88  1    579.92 278.04 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise: compute both the F-statistics and the associated p-value “by hand” and verify your solution with the anova summary table.

6.3 Qualitative predictors

We recycle a textbook example taken from McClave JT., Benson PG. e Sincich T. (2014). Statistics for Business and Economics. Pearson Education Limited.
We want to study the effect on the response (variable distance) of 4 different brands of golf ball (A,B,C,D) (variable brand) and the club type (DRIVER/IRON) (variable club). These two features are qualitative predictors, also called factors. A robot player is used.

wide_golf <- read_tsv("./datasets/golfer.tsv")
## Rows: 8 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): club
## dbl (4): A, B, C, D
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(wide_golf) # prints information about the detected data types
## cols(
##   club = col_character(),
##   A = col_double(),
##   B = col_double(),
##   C = col_double(),
##   D = col_double()
## )

Although it’s not always necessary, it is always best to explicitly tell R to interpret qualitative predictors data as factors. This can be done with read_tsv, first by reading the data as it is and then calling the spec() function over the new tibble.

In this case, the output is saying that the first column has been detected as numeric, which is false, because the golfer number is just an identification number. Therefore we correct this by copying the output, manually editing the first column type from col_double() to col_factor() and setting the read_tsv parameter col_types to that.

wide_golf <- read_tsv("./datasets/golfer.tsv",
  col_types = cols(
    club = readr::col_factor(),
    A = col_double(),
    B = col_double(),
    C = col_double(),
    D = col_double()
  )
)

library(dplyr)
# need to switch from wide to long format
golf <- wide_golf %>%
  gather(brand, distance, A:D)

## if using data.frame, you can use `melt()` from
## reshape2
# golf <- melt(wide_golf, id.vars = 1,
#             variable.name = "brand",
#             value.name = "distance")

Let us study first the effect of brand on the distance.

lm_brand <- lm(distance ~ brand, data = golf)
summary(lm_brand)
## 
## Call:
## lm(formula = distance ~ brand, data = golf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.638 -31.703  -0.481  32.947  42.475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  199.862     12.262  16.299 8.04e-16 ***
## brandB         8.338     17.341   0.481    0.634    
## brandC         5.275     17.341   0.304    0.763    
## brandD        -4.737     17.341  -0.273    0.787    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.68 on 28 degrees of freedom
## Multiple R-squared:  0.02322,    Adjusted R-squared:  -0.08143 
## F-statistic: 0.2219 on 3 and 28 DF,  p-value: 0.8804
model.matrix(lm_brand) # this is the design matrix
##    (Intercept) brandB brandC brandD
## 1            1      0      0      0
## 2            1      0      0      0
## 3            1      0      0      0
## 4            1      0      0      0
## 5            1      0      0      0
## 6            1      0      0      0
## 7            1      0      0      0
## 8            1      0      0      0
## 9            1      1      0      0
## 10           1      1      0      0
## 11           1      1      0      0
## 12           1      1      0      0
## 13           1      1      0      0
## 14           1      1      0      0
## 15           1      1      0      0
## 16           1      1      0      0
## 17           1      0      1      0
## 18           1      0      1      0
## 19           1      0      1      0
## 20           1      0      1      0
## 21           1      0      1      0
## 22           1      0      1      0
## 23           1      0      1      0
## 24           1      0      1      0
## 25           1      0      0      1
## 26           1      0      0      1
## 27           1      0      0      1
## 28           1      0      0      1
## 29           1      0      0      1
## 30           1      0      0      1
## 31           1      0      0      1
## 32           1      0      0      1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$brand
## [1] "contr.treatment"

6.3.1 Data visualization

To visualize the data, we can use boxplots or more specialized graphics for qualitative predictors (for the latter, we have to define the predictors explicitly as factor objects in R). Recall that prettier graphics can always be produced using ggplot2.

boxplot(distance ~ brand, data = golf)

golf$brand <- as.factor(golf$brand)
plot.design(distance ~ brand, data = golf)

6.4 Interactions

Interactions are added in the lm formula. More specifically:

  • a:b (colon op) includes the cross-variable between two predictors
  • a*b (asterisk op) includes the two predictors individually and the cross-variable (i.e. writing y ~ a + b + a:b is equivalent to writing y ~ a*b)

Back to our dataset, let us now add the second factor club and build a ‘small’ additive model…

small_lm <- lm(distance ~ brand + club, data = golf)
summary(small_lm)
## 
## Call:
## lm(formula = distance ~ brand + club, data = golf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9688  -5.2156   0.7375   5.2875  11.2063 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  231.531      3.032  76.371   <2e-16 ***
## brandB         8.338      3.835   2.174   0.0386 *  
## brandC         5.275      3.835   1.376   0.1803    
## brandD        -4.737      3.835  -1.235   0.2273    
## clubIRON     -63.337      2.712 -23.358   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.67 on 27 degrees of freedom
## Multiple R-squared:  0.9539, Adjusted R-squared:  0.9471 
## F-statistic: 139.8 on 4 and 27 DF,  p-value: < 2.2e-16

… and a larger one with interaction.

large_lm <- lm(distance ~ brand * club, data = golf)
summary(large_lm)
## 
## Call:
## lm(formula = distance ~ brand * club, data = golf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6750  -2.7000   0.3125   3.4875  11.8250 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      228.425      2.927  78.051  < 2e-16 ***
## brandB             5.300      4.139   1.281  0.21259    
## brandC            14.675      4.139   3.546  0.00165 ** 
## brandD             1.325      4.139   0.320  0.75163    
## clubIRON         -57.125      4.139 -13.802 6.55e-13 ***
## brandB:clubIRON    6.075      5.853   1.038  0.30966    
## brandC:clubIRON  -18.800      5.853  -3.212  0.00373 ** 
## brandD:clubIRON  -12.125      5.853  -2.072  0.04923 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.853 on 24 degrees of freedom
## Multiple R-squared:  0.9762, Adjusted R-squared:  0.9692 
## F-statistic: 140.4 on 7 and 24 DF,  p-value: < 2.2e-16

An interaction plot suggests interactions may be significant (the lines are not parallel), but we have to test for it since interaction plots are subject to sampling variation.

plot.design(distance ~ brand * club, data = golf)

interaction.plot(golf$brand, golf$club, golf$distance)

6.5 Model comparison

We can compare the complete model to a smaller one with ANOVA test.

anova(large_lm)
## Analysis of Variance Table
## 
## Response: distance
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## brand       3    801     267   7.7908 0.0008401 ***
## club        1  32093   32093 936.7516 < 2.2e-16 ***
## brand:club  3    766     255   7.4524 0.0010789 ** 
## Residuals  24    822      34                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the last anova() command we have built a more traditional anova table. We can compare the two models also with the anova command.

anova(small_lm, large_lm)
## Analysis of Variance Table
## 
## Model 1: distance ~ brand + club
## Model 2: distance ~ brand * club
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)   
## 1     27 1588.20                                
## 2     24  822.24  3    765.96 7.4524 0.001079 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interactions are significant, meaning each combination of the two factors tells a different story.

fitted(large_lm)
##       1       2       3       4       5       6       7       8       9      10 
## 228.425 228.425 228.425 228.425 171.300 171.300 171.300 171.300 233.725 233.725 
##      11      12      13      14      15      16      17      18      19      20 
## 233.725 233.725 182.675 182.675 182.675 182.675 243.100 243.100 243.100 243.100 
##      21      22      23      24      25      26      27      28      29      30 
## 167.175 167.175 167.175 167.175 229.750 229.750 229.750 229.750 160.500 160.500 
##      31      32 
## 160.500 160.500

6.6 Predict

Especially when dealing with qualitative data, predictions for new unseen data can be easily computed with the predict function, which simply applies the regression coefficients to the provided data (arbitrarily generated below inside a tibble).

predict.lm(small_lm,
  newdata = data.frame(
    club = c("DRIVER", "IRON", "IRON"),
    brand = c("A", "B", "B")
  ),
  interval = "confidence"
)
##        fit      lwr      upr
## 1 231.5312 225.3108 237.7517
## 2 176.5312 170.3108 182.7517
## 3 176.5312 170.3108 182.7517
# for prediction intervals...
predict.lm(large_lm,
  newdata = data.frame(club = "DRIVER", brand = "A"),
  interval = "prediction",
  level = .99
)
##       fit      lwr      upr
## 1 228.425 210.1216 246.7284