Module 3: GLMs in general

Author

Karl Ove Hufthammer

Introduction

Generalised linear models (GLMs) generalise the linear (Gaussian) regression models that we looked at in the last module. But now

  • the conditional probability distribution of the outcome doesn’t have to be Gaussian (it can even be discrete), and
  • the association between the predictors and the mean outcome doesn’t have to be linear.

As the name implies, there’s still a linear term involved, but we now have a non-linear function linking this linear term to the mean outcome. This function is called the link function.

In later modules, we’ll look at regression models for count and binomial models. Here, we’ll just look at simple intercept-only models for various probability distributions and using various link functions.

A simple Gaussian model

First, we fit a Gaussian model on simulated data. We simulate 30 observations from a normal distribution with mean 15 and standard deviation 3.

set.seed(17771855) # Reproducible random numbers
dat = data.frame(y = rnorm(30, mean = 15, sd = 3))

The estimated mean, standard deviation and variance (= squared standard deviation) are:

mean(dat$y)
[1] 14.62195
sd(dat$y)
[1] 3.15433
var(dat$y)
[1] 9.949798

The glmmTMB() function can be used as a general parameter estimator function for the supported models if we fit a model without any explanatory variables, i.e., with only an intercept parameter:

library(glmmTMB)
mod = glmmTMB(y ~ 1, data = dat, family = gaussian())
summary(mod)
 Family: gaussian  ( identity )
Formula:          y ~ 1
Data: dat

     AIC      BIC   logLik deviance df.resid 
   157.0    159.8    -76.5    153.0       28 


Dispersion estimate for gaussian family (sigma^2): 9.62 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  14.6220     0.5662   25.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We recognize the Intercept estimate (14.6220) as the sample mean. And the dispersion estimate (9.62, which we can extract using sigma(mod)^2) is very close to the sample variance (9.95). The reason for the difference is that glmmTMB() by default uses maximum likelihood (ML) estimation. The ML estimator of the variance is

\[\frac{1}{n}\sum_{i=1}^n (Y_i-\bar{Y})^2,\]

while the sample variance is defined as

\[\frac{1}{n-1}\sum_{i=1}^n (Y_i-\bar{Y})^2.\]

For moderate to large sample sizes, the difference is negligible. And in general, estimates and tests from glmmTMB models are based on asymptotics, i.e., they are large-sample methods.

Exercise 1

Estimate the variance using the first formula (where you divide by n instead of by n − 1) and confirm that you get the same estimate as in the model (sigma(mod)^2).

Estimating the Poisson mean

Let’s simulate some data from a Poisson distribution with mean 2:

set.seed(17811840)
dat = data.frame(y = rpois(200, lambda = 2))

The Poisson distribution is a discrete distribution with possible outcomes 0, 1, 2, …:

table(dat$y)

 0  1  2  3  4  5  6  7  9 
31 48 49 35 25  6  4  1  1 
library(ggplot2)
ggplot(dat, aes(x = y)) + stat_count()

Each family of distributions that the glmmTMB package supports have a default link function.

Exercise 2

Use ?family_glmmTMB and ?family to see a list of supported distributions and their default link functions. What is the default link function for the Poisson family (the family function is named poisson()).

We fit a simple Poisson model with only an intercept.

mod = glmmTMB(y ~ 1, data = dat, family = poisson())
summary(mod)
 Family: poisson  ( log )
Formula:          y ~ 1
Data: dat

     AIC      BIC   logLik deviance df.resid 
   722.3    725.6   -360.1    720.3      199 


Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.74432    0.04874   15.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercise 3

Use your knowledge of the link function and the above model summary to estimate the mean parameter in the Poisson distribution. (You have to back-transform the coefficient estimate.)

Your answer should be close to the real value, 2.

You can also use the confint() function to extract confidence intervals for parameter estimates.

Exercise 4

Extract a confidence interval for the mean parameter.

Transforming the outcome vs. transforming the mean parameter

Recall the ‘FEV predicted by age’ data and model in module 2:

library(GLMsData)
data(lungcap)
fig_fev = ggplot(lungcap, aes(x = Age, y = FEV)) +
  geom_jitter(position = position_jitter(width = .2)) +
  geom_smooth(method = "lm", se = FALSE)
plot(fig_fev)
`geom_smooth()` using formula = 'y ~ x'

For low values of Age, there is little scatter (low variance) around the blue regression line, while for high values, there is much scatter (high variance). In other words, we have heteroscedasticity.

It’s perhaps easier to see this pattern from a residual plot from the simple linear regression model:

library(GLMsData) # Dataset package
data(lungcap)
mod_age = glmmTMB(FEV ~ Age, data = lungcap)
scatter.smooth(predict(mod_age), residuals(mod_age))

(There is also some evidence of a non-linear association between age and mean FEV.)

We’ll look at two different transformations. We can either transform the actual outcome values or we can transform the mean outcome (using a link function). In the first case, we still have a linear model (but for a different / modified outcome variable). In the second case, we have generalised linear model (but with the original outcome variable)

Transforming the outcome variable

When we have heteroscedasticity where the variance increases with mean, transforming the outcome values can sometimes help. Let’s try log-transforming the FEV values (we can do this since all FEV values are positive):

lungcap$log_fev = log(lungcap$FEV)

And look at the data:

fig_log_fev = ggplot(lungcap, aes(x = Age, y = log_fev)) +
  geom_jitter(position = position_jitter(width = .2)) +
  geom_smooth(se = FALSE)
plot(fig_log_fev)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The variance around the blue line now seems to have stabilised; it’s about the same for all Age values. This is confirmed by a residual plot:

mod_log_y = glmmTMB(log_fev ~ Age, data = lungcap)
scatter.smooth(predict(mod_log_y), residuals(mod_log_y))

(The plot shows strong signs of a non-linear effect of age on log-FEV, but we’ll ignore this for now.)

In the original model, we had

\[E(Y) = \mu_Y = \beta_0 + \beta_1\times Age\]

(with \(Y\) being normally distributed).

Now we have

\[E(\log Y) = \mu_{\log Y} = \beta_0 + \beta_1\times Age\]

(with \(\log Y\) being normally distributed, i.e., \(Y\) being log-normally distributed).

Transforming the mean/expected outcome

We could also try log-transforming the mean instead (this is a GLM model):

\[\log(\mu_Y) = \beta_0 + \beta_1\times Age\]

(with \(Y\) being normally distributed).

mod_log_mu = glmmTMB(FEV ~ Age, data = lungcap,
                     family = gaussian(link = "log"))

Transforming \(Y\) and transforming its mean \(\mu\) seems similar. And the predictions are quite similar:

d_age = data.frame(Age = seq(3, 18, length = 50))
d_age$pred_log_y = exp(predict(mod_log_y, newdata = d_age))
d_age$pred_log_mu = predict(mod_log_mu, newdata = d_age, type = "response")
ggplot(lungcap, aes(x = Age, y = FEV)) +
  geom_point() +
  geom_line(data = d_age, aes(y = pred_log_y), colour = "maroon") +
  geom_line(data = d_age, aes(y = pred_log_mu), colour = "darkgreen")

But the residual plots show that only the model where we transform the actual outcome has fixed the heteroscedasticity problem (though not the non-linear association):

scatter.smooth(predict(mod_log_y), residuals(mod_log_y),
               main = "Model based on log Y")

scatter.smooth(predict(mod_log_mu), residuals(mod_log_mu),
               main = "Model based on log E(Y)")

The other model, with family = gaussian(link= "log") assumes that the residuals are normally distributed with constant variance around the mean function. Which we see is not true.

Note

In this example, transforming the outcome variable and then fitting a linear model gave the best fit. For other data and models, using a link function and not transforming the outcome variable gives a better fit.

Exercise 5

Try fitting a Gaussian linear model with log-FEV as the outcome and a third-degree polynomial of age as the linear predictor. Draw a residual plot. Does the model fit the data well?

A GLM model for the gamma distribution

The gamma distribution is a two-parameter continuous distribution for positive numbers. It’s commonly used to model waiting times, which often have a right-skewed distribution:

curve(dgamma(x, shape = 1.7, rate = 0.5), 0, 15)

The theoretical mean is the ratio of the two parameters, \(\text{rate}/\text{shape} = 1.7/ 0.5 = 3.4\). We’ll try to estimate the mean by fitting an intercept-only model.

Let’s simulate a few observations from this distribution:

set.seed(271828)
dat = data.frame(y = rgamma(30, shape = 1.7, rate = 0.5))
mean(dat$y)
[1] 3.477209

Fit an intercept-only GLM with the gamma family and the default link function. Note that the family function for the gamma distribution is called Gamma() (not gamma()).

Note

The ‘dispersion estimate’ listed in the model output is an estimate of the reciprocal of the shape parameter (1 / shape).

Exercise 6

Use the fitted model to estimate the mean and calculate a 95% confidence interval for the mean. Remember to take the transformation done by the link function into account.

(Look at the help page for ?Gamma for information on the default link function. If you have saved the model as mod, you can also use mod$modelInfo$family$linkfun and mod$modelInfo$family$linkinv to extract the link and inverse link function.)

The confidence intervals calculated by confint() for glmmTMB models by default are asymptotic intervals. They are Wald confidence intervals, basically equal to the estimate (on the link scale) ± 1.96 standard errors.

We can get better confidence intervals by using the argument method = "profile" in confint(). This will calculate profile likelihood confidence intervals, which are more accurate (especially if we use non-default link functions, where the sample distribution of the coefficient estimates on the link scale can be very asymmetric).

Exercise 7

Calculate the profile likelihood confidence interval for the mean.