Module 2: Linear models

Author

Karl Ove Hufthammer

Introduction

In this module, we will look at a dataset on lung capacity and smoking in young people. We will try to estimate the mean lung capacity as a function of age, smoking and a few other variables.

Note

A proper analysis of these data requires domain expertise and detailed information on how the data were collected. We will instead perform some very simple analyses, just to get a feel for how linear models work – and to learn how to fit linear regression models using the glmmTMB package.

The data

In our data, lung capacity is measured by forced expiratory volume (FEV). For this exercise, we only need to know that higher values are better.

The data are stored in the lungcap dataset in the GLMsData package:

library(GLMsData) # Dataset package
data(lungcap)

This is a cross-sectional dataset on nrow(lungcap) subjects. Here are the first and last six rows:

head(lungcap)
  Age   FEV Ht Gender Smoke
1   3 1.072 46      F     0
2   4 0.839 48      F     0
3   4 1.102 48      F     0
4   4 1.389 48      F     0
5   4 1.577 49      F     0
6   4 1.418 49      F     0
tail(lungcap)
    Age   FEV   Ht Gender Smoke
649  16 4.070 69.5      M     1
650  16 4.872 72.0      M     1
651  17 3.082 67.0      M     1
652  17 3.406 69.0      M     1
653  18 4.086 67.0      M     1
654  18 4.404 70.5      M     1

Type ?lungcap to see some information on the dataset and the variables it contains.

Plotting the data

We will fit a regression model where we use a person’s age to predict their (expected) FEV. But before diving into the regression model, we should always plot our data:

library(ggplot2) # Visualisation package
ggplot(lungcap, aes(x = Age, y = FEV)) +
  geom_point()

It looks like there‘s a strong association between age and lung capacity. Perhaps a simple linear regression might fit?

We also see that the age values are recorded as integers, which leads to overplotting (many points on top of each other). It’s easier to see the amount of data we have and notice any patterns if we add a small amount of noise to the Age variable:

# We’ll store the figure (for later use)
# as a variable named fig_fev
fig_fev = ggplot(lungcap, aes(x = Age, y = FEV)) +
  geom_jitter(position = position_jitter(width = .2))
plot(fig_fev)

A simple regression model

OK, let’s fit a linear regression model:

library(glmmTMB) # Regression model package
mod_age = glmmTMB(FEV ~ Age, data = lungcap)
# The function have default values,
# so the above is equivalent to
# glmmTMB(FEV ~ Age, disp = ~1, zi = ~0,
#         family = gaussian(), data = lungcap)

summary(mod_age)
 Family: gaussian  ( identity )
Formula:          FEV ~ Age
Data: lungcap

     AIC      BIC   logLik deviance df.resid 
  1119.0   1132.5   -556.5   1113.0      651 


Dispersion estimate for gaussian family (sigma^2): 0.321 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.431648   0.077776    5.55 2.86e-08 ***
Age         0.222041   0.007507   29.58  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient of interest is Age, and 0.22 is the slope of the regression line relating age to FEV. In other words, subjects of a certain age have an estimated mean FEV which is 0.22 higher than that of subjects which are one year younger. The small P-value (< 2 × 10−16) shows that the slope is statistically significant.

Note

The data are from a cross-sectional study. That is, subjects of different ages were measured at a single point in time. There is only one observation for each subject; the subjects were not followed over time (that would be a longitudinal study, which we would have to analyse with a different type of regression model).

So based on just these data, we can’t really say that FEV increases with 0.22 for each year. (But if the distribution of age and height and the relationship between these variables were stable over time, that would be the implication.)

Let’s extract the intercept and slope and use them to add the regression line to the plot.

coef_fev = fixef(mod_age)$cond
coef_fev
(Intercept)         Age 
  0.4316481   0.2220410 
fig_fev +
  geom_abline(intercept = coef_fev[1],
              slope = coef_fev[2],
              col = "red", linewidth = 1)

For more complicated models, it’s easier to let the computer add the regression line(s) automatically:

library(marginaleffects)
plot_predictions(mod_age, condition = "Age",
                 vcov = FALSE, # See note below
                 points = .25) # Semi-transparent points

Note

Due to a possible bug when calculating standard errors, plot_predictions() currently refuses to plot confidence intervals for glmmTMB models by default, and we need to use the vcov = FALSE argument to avoid an error message.

However, for linear models, it’s actually safe to use the standard errors, and we can change the argument to vcov = vcov(mod_age) (or whatever the name of the model is) to display confidence intervals. We will do this in the following plot_predictions() figures.

Exercise 1

The first coefficient, 0.43, corresponds to the estimated FEV value for a subject of a certain age. Which age is this?

But why should be wary of actually interpreting the coefficient this way?

Model assumptions and residuals

The regression models have various assumptions.

  • The observations are independent.
  • The formula relating the explanatory variables (here: just age) to the mean outcome (here: FEV) is correct. That is, there’s a ‘straight line’ relationship between age and mean FEV.
  • The outcomes are normally distributed with identical standard deviations (homoscedasticity) conditional on the explanatory variables (or really on the linear predictor).

The first assumption typically follows from the design of the study and the way the data are collected. For this exercise, we’ll just assume it’s true.

To check the second assumption, we can plot the residuals (real FEV minus estimated mean FEV for each subject) against the estimated FEV. We also add a nonparametric smoother:

scatter.smooth(predict(mod_age), residuals(mod_age))

If the functional form of the relationship between the explanatory variables and the outcome is correct, the residuals should be centred around 0. The smooth line should then look (approximately) like a horizontal line (‘y = 0’).

Exercise 2

Does the functional form of the relationship between age and FEV look correct?

Exercise 3

Does the standard deviation around zero seem to be constant (homoscedasticity), i.e., does it seem to be the same for each predicted value?

There are also plots we can use to check if the residuals look normally distributed (QQ plots), but we’ll skip this for now and come back to such plots in a later module.

Residual plots for each explanatory variable

In general, we should also plot the residuals against each explanatory variable.

Since there’s only a single explanatory variable, Age, plotting the residuals against age is equivalent to plotting the residuals against the predicted values (only the labelling on the axis changes):

scatter.smooth(lungcap$Age, residuals(mod_age))

Warning

This second code example works fine here, but is in general not safe. If there are missing data (NA values) in any of the variables included in the model, the output of the residuals(…) call will include fewer observations than the number of rows in the dataset (NA values are skipped). This means that the two vectors (lungcap$Age and residuals(mod)) won’t match up. There are ways to avoid this problem, which we will look at in a later module.

Note

In regression, different people and textbooks use different terminology. The following terms mean (more or less) the same thing:

  • outcome = response (variable) = dependent variable
  • explanatory variable = predictor = covariate = independent variable = feature

We will use the terms interchangeably.

Smoking as a predictor

We will now look at smoking. We might expect that smokers have lower lung capacity than non-smokers. Let’s check that, using a simple linear model.

The Smoke variable is coded as 0 for non-smokers and 1 for smokers. We can fit a regression model using this variable directly. But for categorical variables with more than two possible values (levels), we need to code the variable as a factor. So we’ll do this here too. It will also result in nicer labels in tables and plots.

lungcap$smoke_fac = factor(lungcap$Smoke,
                           levels = c(0, 1), 
                           labels = c("Non-smoker", "Smoker"))
head(lungcap)
  Age   FEV Ht Gender Smoke  smoke_fac
1   3 1.072 46      F     0 Non-smoker
2   4 0.839 48      F     0 Non-smoker
3   4 1.102 48      F     0 Non-smoker
4   4 1.389 48      F     0 Non-smoker
5   4 1.577 49      F     0 Non-smoker
6   4 1.418 49      F     0 Non-smoker

The first level (here: "Non-smoker") is automatically treated as the reference/baseline level in regression models.

We fit a similar model as before:

mod_smoke = glmmTMB(FEV ~ smoke_fac, data = lungcap)
summary(mod_smoke)
 Family: gaussian  ( identity )
Formula:          FEV ~ smoke_fac
Data: lungcap

     AIC      BIC   logLik deviance df.resid 
  1633.8   1647.2   -813.9   1627.8      651 


Dispersion estimate for gaussian family (sigma^2): 0.705 

Conditional model:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      2.56614    0.03461   74.15  < 2e-16 ***
smoke_facSmoker  0.71072    0.10977    6.47 9.52e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Try also fitting a model with Smoke (instead of smoke_fac) as a predictor. Verify that you get identical coefficient estimates.

Exercise 4

Interpret the two coefficients ((Intercept) and Smoke) in your model.

Does your interpretation agree with our initial hypothesis (that smokers have lower lung capacity than non-smokers)? If not, do you have any idea why this could be?

You can also plot the predicted mean FEV values to check your answer:

plot_predictions(mod_smoke, condition = "smoke_fac",
                 vcov = vcov(mod_smoke))

Stratifying by smoking status

Let’s look at the data again, but now separately for smokers and non-smokers:

fig_fev %+% 
  lungcap + # Updated version containing the ‘smoke_fac’ variable
  aes(colour = smoke_fac)

Do you now understand why you got the (misleading) coefficients in your previous smoker model?

Smoking and age as predictors

Age affects a subject’s probability of being a smoker, and age also affects the subject’s FEV. This means that age is a confounder for the effect of smoking on FEV.

Which again means that we have to adjust for age when trying to figure out the effect of smoking on FEV. We can do this in various ways, for example using regression.

Fit a linear regression model with both smoking and age as explanatory factors. Name the model mod_smoke_age. The right-hand side of the regression formula should look like this: smoke_fac + Age

Make sure you can interpret the coefficients. Plot the two regression lines estimated by the model using a command like this:

plot_predictions(mod_smoke_age, condition = c("Age", "smoke_fac"),
                 vcov = vcov(mod_smoke_age), points = .25)

Also try running the same plot_predictions() command, but now with condition = c("smoke_fac", "Age").

(For more information on plot_predictions() and its various options, see the official documentation.)

Interactions

In the previous model, we assumed that the effect of age was the same for both smokers and non-smokers. Is it?

We can check this by fitting a model with an interaction for age and smoking. Since there are few smokers in the younger age groups (and to make the plots look less confusing) we’ll fit the models only on subjects older than 10 years:

lungcap_older = subset(lungcap, Age > 10)
mod_interac = glmmTMB(FEV ~ Age * smoke_fac, data = lungcap_older)
# This is the same as
#   glmmTMB(FEV ~ Age + smoke_fac + Age:smoke_fac,
#           data = lungcap_older)

summary(mod_interac)
 Family: gaussian  ( identity )
Formula:          FEV ~ Age * smoke_fac
Data: lungcap_older

     AIC      BIC   logLik deviance df.resid 
   564.8    582.7   -277.4    554.8      259 


Dispersion estimate for gaussian family (sigma^2): 0.479 

Conditional model:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          1.09369    0.34016   3.215   0.0013 ** 
Age                  0.17930    0.02699   6.642 3.09e-11 ***
smoke_facSmoker      1.29696    0.69312   1.871   0.0613 .  
Age:smoke_facSmoker -0.11261    0.05074  -2.219   0.0265 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value for the Age:smoke_facSmoker interaction indicates that there’s an interaction. The effect of age on FEV depends on whether you’re a smoker or a non-smoker. The coefficient is negative, which means that the slope of the age line is less steep for smokers. In other words, if you’re a smoker, being older doesn’t help much when it comes to lung function!

Or, equivalently, we can also say the effect of smoking differ between subjects of different ages. As a smoker, the deficiency in expected lung function compared to a similarly-aged non-smoker is greater the older you are.

All this is easier to see on a plot:

plot_predictions(mod_interac, condition = c("Age", "smoke_fac"),
                 vcov = vcov(mod_interac), points = .25)

Exercise 5

Use the coefficient estimates to answer the following questions:

  1. What is the slope of the regression line for a non-smoker?
  2. What is the slope of the regression line for a smoker?
  3. What is the estimated mean FEV for an 18-year-old non-smoker?
  4. What is the estimated mean FEV for an 18-year-old smoker?

Main effects in interaction models

Warning

It’s very tempting to interpret the P-values for the main effects (Age and smoke_facSmoker) in the interaction model as evidence for or against there being an ‘age’ or ‘smoking’ effect. Resist the temptation! These P-values seldom make sense. They are technically correct, but they usually don’t test reasonable/useful hypotheses.

When we have an interaction between two variables, x and z, the coefficient for the ‘main effect’ of x measures the effect of x on the outcome when z is equal to its reference level (for categorical variables) or equal to 0 (for continuous variables).

So here, the Age coefficient (0.18) is the slope of the regression line for a non-smoker. And the smoke_facSmoker coefficient (1.30) is the effect of smoking for a person aged 0 years. (We’re now extrapolating far outside the area where we have data!)

In general, if there’s a statistically significant interaction term, this always indicates that there’s an effect of both variables (here age and smoking). But the effect of the first one depends on the value of the second one, and vice versa.

To illustrate how reference levels affect the coefficients, let’s create a new smoking variable, smoke_fac_rev, where we use smoker instead of non-smoker as the reference level:

lungcap_older$smoke_fac_rev = relevel(lungcap_older$smoke_fac, "Smoker")

Fit a similar interaction model as before but with smoke_fac_rev instead of smoke_fac.

mod_interac2 = glmmTMB(FEV ~ Age * smoke_fac_rev,
                       data = lungcap_older)
summary(mod_interac2)
 Family: gaussian  ( identity )
Formula:          FEV ~ Age * smoke_fac_rev
Data: lungcap_older

     AIC      BIC   logLik deviance df.resid 
   564.8    582.7   -277.4    554.8      259 


Dispersion estimate for gaussian family (sigma^2): 0.479 

Conditional model:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  2.39065    0.60391   3.959 7.54e-05 ***
Age                          0.06669    0.04297   1.552   0.1206    
smoke_fac_revNon-smoker     -1.29696    0.69312  -1.871   0.0613 .  
Age:smoke_fac_revNon-smoker  0.11261    0.05074   2.219   0.0265 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare the coefficients and P-values for the main effects and interaction with the corresponding coefficients and P-values in the previous interaction model.

Exercise 6

How would you interpret the coefficient and P-value for the Age main effect in this model? Is it consistent with the estimated regression lines? Is there an effect of age on FEV?

Also plot the model to confirm that you get the exact same regression lines as before. The model is the same; it’s only the parameterisation that’s different.

Non-linear effects

Let’s ignore smoking for a while. Recall the residual plot for our simple ‘age predicting FEV’ model:

scatter.smooth(predict(mod_age), residuals(mod_age))

The residuals were mostly centred around 0, but for high fitted values (in this simple model corresponding to high Age values), there was some evidence of a ‘drop’ (negative residuals).

Since the residuals are the real FEV values minus the predicted FEV values, it looks like the model overestimates the mean FEV value for older subjects.

Some naïve thoughts:

It makes sense that the lung capacity increases with age. As you get older, your entire body grows, including your lungs and chest cavity. So we might expect better lung function as you age. But in the later teenage years, you are almost fully grown, and we might expect a plateau.

Our dataset also includes a variable for body height (Ht, measured in inches), so we can check when the young people in the study have reached their maximum height (and therefore perhaps their maximum FEV?). We’ll also add a simple non-parametric smoother to more clearly see the pattern:

ggplot(lungcap, aes(x = Age, y = Ht)) +
  geom_jitter(position = position_jitter(width = .2)) +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The height seems to have plateaued at around 14–15 years of age. So perhaps the FEV will have a similar plateau, meaning that a linear model is not adequate?

We will check this by fitting a model with a quadratic effect of age:

mod = glmmTMB(FEV ~ Age + I(Age^2), data = lungcap)
summary(mod)
 Family: gaussian  ( identity )
Formula:          FEV ~ Age + I(Age^2)
Data: lungcap

     AIC      BIC   logLik deviance df.resid 
  1102.8   1120.7   -547.4   1094.8      650 


Dispersion estimate for gaussian family (sigma^2): 0.312 

Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.360359   0.199329  -1.808   0.0706 .  
Age          0.385707   0.038733   9.958  < 2e-16 ***
I(Age^2)    -0.007764   0.001804  -4.305 1.67e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note

Note. The ^ symbol has a special meaning in formulas, so we have to wrap it in an I() function for it to retain its mathematical meaning.

Exercise 7

Look at the P-value for the second-degree term. Is this evidence for or against (just) a linear effect of age on FEV?

Note

Even with a second-degree term, this model is – perhaps confusingly – still called a linear regression model. A linear regression model is linear in the parameters (i.e., the coefficients). But the predictors can have arbitrary transformations, like in the above model.

(You don’t have to hand in your answers for the following.)

Plot the estimated regression line plot_predictions(). Does it look reasonable / does it fit the data?

Also plot a residual plot using scatter.smooth(). Does it look better than in the previous model? Are you happy with it? (If not, you might try adding a third-degree term.)

Putting it all together

If you have time (and nothing better to do), try exploring the dataset further.

There are other and perhaps important variables that we haven’t used in our regression model (Gender and Ht). Just for fun, try fitting a final model including all the variables. (You can skip non-linear and interaction terms. Or not – it’s up to you!)

How does the residual plot look now?