library(GLMsData) # Dataset package
data(lungcap)
Module 2: Linear models
Introduction
In this module, we will look at a dataset on lung capacity and smoking in young people. We will try to estimate the mean lung capacity as a function of age, smoking and a few other variables.
A proper analysis of these data requires domain expertise and detailed information on how the data were collected. We will instead perform some very simple analyses, just to get a feel for how linear models work – and to learn how to fit linear regression models using the glmmTMB
package.
The data
In our data, lung capacity is measured by forced expiratory volume (FEV). For this exercise, we only need to know that higher values are better.
The data are stored in the lungcap
dataset in the GLMsData
package:
This is a cross-sectional dataset on nrow(lungcap)
subjects. Here are the first and last six rows:
head(lungcap)
Age FEV Ht Gender Smoke
1 3 1.072 46 F 0
2 4 0.839 48 F 0
3 4 1.102 48 F 0
4 4 1.389 48 F 0
5 4 1.577 49 F 0
6 4 1.418 49 F 0
tail(lungcap)
Age FEV Ht Gender Smoke
649 16 4.070 69.5 M 1
650 16 4.872 72.0 M 1
651 17 3.082 67.0 M 1
652 17 3.406 69.0 M 1
653 18 4.086 67.0 M 1
654 18 4.404 70.5 M 1
Type ?lungcap
to see some information on the dataset and the variables it contains.
Plotting the data
We will fit a regression model where we use a person’s age to predict their (expected) FEV. But before diving into the regression model, we should always plot our data:
library(ggplot2) # Visualisation package
ggplot(lungcap, aes(x = Age, y = FEV)) +
geom_point()
It looks like there‘s a strong association between age and lung capacity. Perhaps a simple linear regression might fit?
We also see that the age values are recorded as integers, which leads to overplotting (many points on top of each other). It’s easier to see the amount of data we have and notice any patterns if we add a small amount of noise to the Age
variable:
# We’ll store the figure (for later use)
# as a variable named fig_fev
= ggplot(lungcap, aes(x = Age, y = FEV)) +
fig_fev geom_jitter(position = position_jitter(width = .2))
plot(fig_fev)
A simple regression model
OK, let’s fit a linear regression model:
library(glmmTMB) # Regression model package
= glmmTMB(FEV ~ Age, data = lungcap)
mod_age # The function have default values,
# so the above is equivalent to
# glmmTMB(FEV ~ Age, disp = ~1, zi = ~0,
# family = gaussian(), data = lungcap)
summary(mod_age)
Family: gaussian ( identity )
Formula: FEV ~ Age
Data: lungcap
AIC BIC logLik deviance df.resid
1119.0 1132.5 -556.5 1113.0 651
Dispersion estimate for gaussian family (sigma^2): 0.321
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.431648 0.077776 5.55 2.86e-08 ***
Age 0.222041 0.007507 29.58 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient of interest is Age
, and 0.22 is the slope of the regression line relating age to FEV. In other words, subjects of a certain age have an estimated mean FEV which is 0.22 higher than that of subjects which are one year younger. The small P-value (< 2 × 10−16) shows that the slope is statistically significant.
The data are from a cross-sectional study. That is, subjects of different ages were measured at a single point in time. There is only one observation for each subject; the subjects were not followed over time (that would be a longitudinal study, which we would have to analyse with a different type of regression model).
So based on just these data, we can’t really say that FEV increases with 0.22 for each year. (But if the distribution of age and height and the relationship between these variables were stable over time, that would be the implication.)
Let’s extract the intercept and slope and use them to add the regression line to the plot.
= fixef(mod_age)$cond
coef_fev coef_fev
(Intercept) Age
0.4316481 0.2220410
+
fig_fev geom_abline(intercept = coef_fev[1],
slope = coef_fev[2],
col = "red", linewidth = 1)
For more complicated models, it’s easier to let the computer add the regression line(s) automatically:
library(marginaleffects)
plot_predictions(mod_age, condition = "Age",
vcov = FALSE, # See note below
points = .25) # Semi-transparent points
Due to a possible bug when calculating standard errors, plot_predictions()
currently refuses to plot confidence intervals for glmmTMB
models by default, and we need to use the vcov = FALSE
argument to avoid an error message.
However, for linear models, it’s actually safe to use the standard errors, and we can change the argument to vcov = vcov(mod_age)
(or whatever the name of the model is) to display confidence intervals. We will do this in the following plot_predictions()
figures.
Model assumptions and residuals
The regression models have various assumptions.
- The observations are independent.
- The formula relating the explanatory variables (here: just age) to the mean outcome (here: FEV) is correct. That is, there’s a ‘straight line’ relationship between age and mean FEV.
- The outcomes are normally distributed with identical standard deviations (homoscedasticity) conditional on the explanatory variables (or really on the linear predictor).
The first assumption typically follows from the design of the study and the way the data are collected. For this exercise, we’ll just assume it’s true.
To check the second assumption, we can plot the residuals (real FEV minus estimated mean FEV for each subject) against the estimated FEV. We also add a nonparametric smoother:
scatter.smooth(predict(mod_age), residuals(mod_age))
If the functional form of the relationship between the explanatory variables and the outcome is correct, the residuals should be centred around 0. The smooth line should then look (approximately) like a horizontal line (‘y = 0’).
There are also plots we can use to check if the residuals look normally distributed (QQ plots), but we’ll skip this for now and come back to such plots in a later module.
Residual plots for each explanatory variable
In general, we should also plot the residuals against each explanatory variable.
Since there’s only a single explanatory variable, Age
, plotting the residuals against age is equivalent to plotting the residuals against the predicted values (only the labelling on the axis changes):
scatter.smooth(lungcap$Age, residuals(mod_age))
This second code example works fine here, but is in general not safe. If there are missing data (NA
values) in any of the variables included in the model, the output of the residuals(…)
call will include fewer observations than the number of rows in the dataset (NA
values are skipped). This means that the two vectors (lungcap$Age
and residuals(mod)
) won’t match up. There are ways to avoid this problem, which we will look at in a later module.
In regression, different people and textbooks use different terminology. The following terms mean (more or less) the same thing:
- outcome = response (variable) = dependent variable
- explanatory variable = predictor = covariate = independent variable = feature
We will use the terms interchangeably.
Smoking as a predictor
We will now look at smoking. We might expect that smokers have lower lung capacity than non-smokers. Let’s check that, using a simple linear model.
The Smoke
variable is coded as 0
for non-smokers and 1
for smokers. We can fit a regression model using this variable directly. But for categorical variables with more than two possible values (levels), we need to code the variable as a factor. So we’ll do this here too. It will also result in nicer labels in tables and plots.
$smoke_fac = factor(lungcap$Smoke,
lungcaplevels = c(0, 1),
labels = c("Non-smoker", "Smoker"))
head(lungcap)
Age FEV Ht Gender Smoke smoke_fac
1 3 1.072 46 F 0 Non-smoker
2 4 0.839 48 F 0 Non-smoker
3 4 1.102 48 F 0 Non-smoker
4 4 1.389 48 F 0 Non-smoker
5 4 1.577 49 F 0 Non-smoker
6 4 1.418 49 F 0 Non-smoker
The first level (here: "Non-smoker"
) is automatically treated as the reference/baseline level in regression models.
We fit a similar model as before:
= glmmTMB(FEV ~ smoke_fac, data = lungcap)
mod_smoke summary(mod_smoke)
Family: gaussian ( identity )
Formula: FEV ~ smoke_fac
Data: lungcap
AIC BIC logLik deviance df.resid
1633.8 1647.2 -813.9 1627.8 651
Dispersion estimate for gaussian family (sigma^2): 0.705
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.56614 0.03461 74.15 < 2e-16 ***
smoke_facSmoker 0.71072 0.10977 6.47 9.52e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Try also fitting a model with Smoke
(instead of smoke_fac
) as a predictor. Verify that you get identical coefficient estimates.
Stratifying by smoking status
Let’s look at the data again, but now separately for smokers and non-smokers:
%+%
fig_fev + # Updated version containing the ‘smoke_fac’ variable
lungcap aes(colour = smoke_fac)
Do you now understand why you got the (misleading) coefficients in your previous smoker model?
Smoking and age as predictors
Age affects a subject’s probability of being a smoker, and age also affects the subject’s FEV. This means that age is a confounder for the effect of smoking on FEV.
Which again means that we have to adjust for age when trying to figure out the effect of smoking on FEV. We can do this in various ways, for example using regression.
Fit a linear regression model with both smoking and age as explanatory factors. Name the model mod_smoke_age
. The right-hand side of the regression formula should look like this: smoke_fac + Age
Make sure you can interpret the coefficients. Plot the two regression lines estimated by the model using a command like this:
plot_predictions(mod_smoke_age, condition = c("Age", "smoke_fac"),
vcov = vcov(mod_smoke_age), points = .25)
Also try running the same plot_predictions()
command, but now with condition = c("smoke_fac", "Age")
.
(For more information on plot_predictions()
and its various options, see the official documentation.)
Interactions
In the previous model, we assumed that the effect of age was the same for both smokers and non-smokers. Is it?
We can check this by fitting a model with an interaction for age and smoking. Since there are few smokers in the younger age groups (and to make the plots look less confusing) we’ll fit the models only on subjects older than 10 years:
= subset(lungcap, Age > 10)
lungcap_older = glmmTMB(FEV ~ Age * smoke_fac, data = lungcap_older)
mod_interac # This is the same as
# glmmTMB(FEV ~ Age + smoke_fac + Age:smoke_fac,
# data = lungcap_older)
summary(mod_interac)
Family: gaussian ( identity )
Formula: FEV ~ Age * smoke_fac
Data: lungcap_older
AIC BIC logLik deviance df.resid
564.8 582.7 -277.4 554.8 259
Dispersion estimate for gaussian family (sigma^2): 0.479
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.09369 0.34016 3.215 0.0013 **
Age 0.17930 0.02699 6.642 3.09e-11 ***
smoke_facSmoker 1.29696 0.69312 1.871 0.0613 .
Age:smoke_facSmoker -0.11261 0.05074 -2.219 0.0265 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The P-value for the Age:smoke_facSmoker
interaction indicates that there’s an interaction. The effect of age on FEV depends on whether you’re a smoker or a non-smoker. The coefficient is negative, which means that the slope of the age line is less steep for smokers. In other words, if you’re a smoker, being older doesn’t help much when it comes to lung function!
Or, equivalently, we can also say the effect of smoking differ between subjects of different ages. As a smoker, the deficiency in expected lung function compared to a similarly-aged non-smoker is greater the older you are.
All this is easier to see on a plot:
plot_predictions(mod_interac, condition = c("Age", "smoke_fac"),
vcov = vcov(mod_interac), points = .25)
Main effects in interaction models
It’s very tempting to interpret the P-values for the main effects (Age
and smoke_facSmoker
) in the interaction model as evidence for or against there being an ‘age’ or ‘smoking’ effect. Resist the temptation! These P-values seldom make sense. They are technically correct, but they usually don’t test reasonable/useful hypotheses.
When we have an interaction between two variables, x
and z
, the coefficient for the ‘main effect’ of x
measures the effect of x
on the outcome when z
is equal to its reference level (for categorical variables) or equal to 0 (for continuous variables).
So here, the Age
coefficient (0.18) is the slope of the regression line for a non-smoker. And the smoke_facSmoker
coefficient (1.30) is the effect of smoking for a person aged 0 years. (We’re now extrapolating far outside the area where we have data!)
In general, if there’s a statistically significant interaction term, this always indicates that there’s an effect of both variables (here age and smoking). But the effect of the first one depends on the value of the second one, and vice versa.
To illustrate how reference levels affect the coefficients, let’s create a new smoking variable, smoke_fac_rev
, where we use smoker instead of non-smoker as the reference level:
$smoke_fac_rev = relevel(lungcap_older$smoke_fac, "Smoker") lungcap_older
Fit a similar interaction model as before but with smoke_fac_rev
instead of smoke_fac
.
= glmmTMB(FEV ~ Age * smoke_fac_rev,
mod_interac2 data = lungcap_older)
summary(mod_interac2)
Family: gaussian ( identity )
Formula: FEV ~ Age * smoke_fac_rev
Data: lungcap_older
AIC BIC logLik deviance df.resid
564.8 582.7 -277.4 554.8 259
Dispersion estimate for gaussian family (sigma^2): 0.479
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.39065 0.60391 3.959 7.54e-05 ***
Age 0.06669 0.04297 1.552 0.1206
smoke_fac_revNon-smoker -1.29696 0.69312 -1.871 0.0613 .
Age:smoke_fac_revNon-smoker 0.11261 0.05074 2.219 0.0265 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare the coefficients and P-values for the main effects and interaction with the corresponding coefficients and P-values in the previous interaction model.
Also plot the model to confirm that you get the exact same regression lines as before. The model is the same; it’s only the parameterisation that’s different.
Non-linear effects
Let’s ignore smoking for a while. Recall the residual plot for our simple ‘age predicting FEV’ model:
scatter.smooth(predict(mod_age), residuals(mod_age))
The residuals were mostly centred around 0, but for high fitted values (in this simple model corresponding to high Age
values), there was some evidence of a ‘drop’ (negative residuals).
Since the residuals are the real FEV values minus the predicted FEV values, it looks like the model overestimates the mean FEV value for older subjects.
Some naïve thoughts:
It makes sense that the lung capacity increases with age. As you get older, your entire body grows, including your lungs and chest cavity. So we might expect better lung function as you age. But in the later teenage years, you are almost fully grown, and we might expect a plateau.
Our dataset also includes a variable for body height (Ht
, measured in inches), so we can check when the young people in the study have reached their maximum height (and therefore perhaps their maximum FEV?). We’ll also add a simple non-parametric smoother to more clearly see the pattern:
ggplot(lungcap, aes(x = Age, y = Ht)) +
geom_jitter(position = position_jitter(width = .2)) +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The height seems to have plateaued at around 14–15 years of age. So perhaps the FEV will have a similar plateau, meaning that a linear model is not adequate?
We will check this by fitting a model with a quadratic effect of age:
= glmmTMB(FEV ~ Age + I(Age^2), data = lungcap)
mod summary(mod)
Family: gaussian ( identity )
Formula: FEV ~ Age + I(Age^2)
Data: lungcap
AIC BIC logLik deviance df.resid
1102.8 1120.7 -547.4 1094.8 650
Dispersion estimate for gaussian family (sigma^2): 0.312
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.360359 0.199329 -1.808 0.0706 .
Age 0.385707 0.038733 9.958 < 2e-16 ***
I(Age^2) -0.007764 0.001804 -4.305 1.67e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note. The ^
symbol has a special meaning in formulas, so we have to wrap it in an I()
function for it to retain its mathematical meaning.
Even with a second-degree term, this model is – perhaps confusingly – still called a linear regression model. A linear regression model is linear in the parameters (i.e., the coefficients). But the predictors can have arbitrary transformations, like in the above model.
(You don’t have to hand in your answers for the following.)
Plot the estimated regression line plot_predictions()
. Does it look reasonable / does it fit the data?
Also plot a residual plot using scatter.smooth()
. Does it look better than in the previous model? Are you happy with it? (If not, you might try adding a third-degree term.)
Putting it all together
If you have time (and nothing better to do), try exploring the dataset further.
There are other and perhaps important variables that we haven’t used in our regression model (Gender
and Ht
). Just for fun, try fitting a final model including all the variables. (You can skip non-linear and interaction terms. Or not – it’s up to you!)
How does the residual plot look now?