set.seed(17771855) # Reproducible random numbers
= data.frame(y = rnorm(30, mean = 15, sd = 3)) dat
Module 3: GLMs in general
Introduction
Generalised linear models (GLMs) generalise the linear (Gaussian) regression models that we looked at in the last module. But now
- the conditional probability distribution of the outcome doesn’t have to be Gaussian (it can even be discrete), and
- the association between the predictors and the mean outcome doesn’t have to be linear.
As the name implies, there’s still a linear term involved, but we now have a non-linear function linking this linear term to the mean outcome. This function is called the link function.
In later modules, we’ll look at regression models for count and binomial models. Here, we’ll just look at simple intercept-only models for various probability distributions and using various link functions.
A simple Gaussian model
First, we fit a Gaussian model on simulated data. We simulate 30 observations from a normal distribution with mean 15 and standard deviation 3.
The estimated mean, standard deviation and variance (= squared standard deviation) are:
mean(dat$y)
[1] 14.62195
sd(dat$y)
[1] 3.15433
var(dat$y)
[1] 9.949798
The glmmTMB()
function can be used as a general parameter estimator function for the supported models if we fit a model without any explanatory variables, i.e., with only an intercept parameter:
library(glmmTMB)
= glmmTMB(y ~ 1, data = dat, family = gaussian())
mod summary(mod)
Family: gaussian ( identity )
Formula: y ~ 1
Data: dat
AIC BIC logLik deviance df.resid
157.0 159.8 -76.5 153.0 28
Dispersion estimate for gaussian family (sigma^2): 9.62
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 14.6220 0.5662 25.82 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We recognize the Intercept
estimate (14.6220) as the sample mean. And the dispersion estimate (9.62, which we can extract using sigma(mod)^2
) is very close to the sample variance (9.95). The reason for the difference is that glmmTMB()
by default uses maximum likelihood (ML) estimation. The ML estimator of the variance is
\[\frac{1}{n}\sum_{i=1}^n (Y_i-\bar{Y})^2,\]
while the sample variance is defined as
\[\frac{1}{n-1}\sum_{i=1}^n (Y_i-\bar{Y})^2.\]
For moderate to large sample sizes, the difference is negligible. And in general, estimates and tests from glmmTMB
models are based on asymptotics, i.e., they are large-sample methods.
A Gaussian model with a different link function
In generalized linear models, the mean doesn’t have to be a linear function of the parameters, \(E(Y) = \mu = \pmb{Xb}\). Instead, we can use a link function, g, \(g(\mu) = \pmb{Xb}\).
For Gaussian models, it’s natural to use the identity function, \(g(\mu) = \mu\), as the link function. But we can use other functions, like \(g(\mu) = 1/\mu\):
= glmmTMB(y ~ 1, data = dat,
mod_inv family = gaussian(link = "inverse"))
summary(mod_inv)
Family: gaussian ( inverse )
Formula: y ~ 1
Data: dat
AIC BIC logLik deviance df.resid
157.0 159.8 -76.5 153.0 28
Dispersion estimate for gaussian family (sigma^2): 9.62
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.068390 0.002647 25.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We then have to back-transform our estimate to recover the mean. With the link function \(g\) and defining \(\eta\) as \(\eta = g(\mu) = 1/\mu\), it’s easy to show that here (but not in general), the inverse of the link function is identical to the link function itself, \(g^{-1}(\eta) = 1/\eta\). And we can use this to calculate the mean:
= fixef(mod_inv)$cond
mu_inv 1/mu_inv
(Intercept)
14.62195
Estimating the Poisson mean
Let’s simulate some data from a Poisson distribution with mean 2:
set.seed(17811840)
= data.frame(y = rpois(200, lambda = 2)) dat
The Poisson distribution is a discrete distribution with possible outcomes 0, 1, 2, …:
table(dat$y)
0 1 2 3 4 5 6 7 9
31 48 49 35 25 6 4 1 1
library(ggplot2)
ggplot(dat, aes(x = y)) + stat_count()
Each family of distributions that the glmmTMB
package supports have a default link function.
We fit a simple Poisson model with only an intercept.
= glmmTMB(y ~ 1, data = dat, family = poisson())
mod summary(mod)
Family: poisson ( log )
Formula: y ~ 1
Data: dat
AIC BIC logLik deviance df.resid
722.3 725.6 -360.1 720.3 199
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.74432 0.04874 15.27 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can also use the confint()
function to extract confidence intervals for parameter estimates.
Transforming the outcome vs. transforming the mean parameter
Recall the ‘FEV predicted by age’ data and model in module 2:
library(GLMsData)
data(lungcap)
= ggplot(lungcap, aes(x = Age, y = FEV)) +
fig_fev geom_jitter(position = position_jitter(width = .2)) +
geom_smooth(method = "lm", se = FALSE)
plot(fig_fev)
`geom_smooth()` using formula = 'y ~ x'
For low values of Age
, there is little scatter (low variance) around the blue regression line, while for high values, there is much scatter (high variance). In other words, we have heteroscedasticity.
It’s perhaps easier to see this pattern from a residual plot from the simple linear regression model:
library(GLMsData) # Dataset package
data(lungcap)
= glmmTMB(FEV ~ Age, data = lungcap)
mod_age scatter.smooth(predict(mod_age), residuals(mod_age))
(There is also some evidence of a non-linear association between age and mean FEV.)
We’ll look at two different transformations. We can either transform the actual outcome values or we can transform the mean outcome (using a link function). In the first case, we still have a linear model (but for a different / modified outcome variable). In the second case, we have generalised linear model (but with the original outcome variable)
Transforming the outcome variable
When we have heteroscedasticity where the variance increases with mean, transforming the outcome values can sometimes help. Let’s try log-transforming the FEV values (we can do this since all FEV values are positive):
$log_fev = log(lungcap$FEV) lungcap
And look at the data:
= ggplot(lungcap, aes(x = Age, y = log_fev)) +
fig_log_fev geom_jitter(position = position_jitter(width = .2)) +
geom_smooth(se = FALSE)
plot(fig_log_fev)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The variance around the blue line now seems to have stabilised; it’s about the same for all Age
values. This is confirmed by a residual plot:
= glmmTMB(log_fev ~ Age, data = lungcap)
mod_log_y scatter.smooth(predict(mod_log_y), residuals(mod_log_y))
(The plot shows strong signs of a non-linear effect of age on log-FEV, but we’ll ignore this for now.)
In the original model, we had
\[E(Y) = \mu_Y = \beta_0 + \beta_1\times Age\]
(with \(Y\) being normally distributed).
Now we have
\[E(\log Y) = \mu_{\log Y} = \beta_0 + \beta_1\times Age\]
(with \(\log Y\) being normally distributed, i.e., \(Y\) being log-normally distributed).
Transforming the mean/expected outcome
We could also try log-transforming the mean instead (this is a GLM model):
\[\log(\mu_Y) = \beta_0 + \beta_1\times Age\]
(with \(Y\) being normally distributed).
= glmmTMB(FEV ~ Age, data = lungcap,
mod_log_mu family = gaussian(link = "log"))
Transforming \(Y\) and transforming its mean \(\mu\) seems similar. And the predictions are quite similar:
= data.frame(Age = seq(3, 18, length = 50))
d_age $pred_log_y = exp(predict(mod_log_y, newdata = d_age))
d_age$pred_log_mu = predict(mod_log_mu, newdata = d_age, type = "response")
d_ageggplot(lungcap, aes(x = Age, y = FEV)) +
geom_point() +
geom_line(data = d_age, aes(y = pred_log_y), colour = "maroon") +
geom_line(data = d_age, aes(y = pred_log_mu), colour = "darkgreen")
But the residual plots show that only the model where we transform the actual outcome has fixed the heteroscedasticity problem (though not the non-linear association):
scatter.smooth(predict(mod_log_y), residuals(mod_log_y),
main = "Model based on log Y")
scatter.smooth(predict(mod_log_mu), residuals(mod_log_mu),
main = "Model based on log E(Y)")
The other model, with family = gaussian(link= "log")
assumes that the residuals are normally distributed with constant variance around the mean function. Which we see is not true.
In this example, transforming the outcome variable and then fitting a linear model gave the best fit. For other data and models, using a link function and not transforming the outcome variable gives a better fit.
A GLM model for the gamma distribution
The gamma distribution is a two-parameter continuous distribution for positive numbers. It’s commonly used to model waiting times, which often have a right-skewed distribution:
curve(dgamma(x, shape = 1.7, rate = 0.5), 0, 15)
The theoretical mean is the ratio of the two parameters, \(\text{rate}/\text{shape} = 1.7/ 0.5 = 3.4\). We’ll try to estimate the mean by fitting an intercept-only model.
Let’s simulate a few observations from this distribution:
set.seed(271828)
= data.frame(y = rgamma(30, shape = 1.7, rate = 0.5))
dat mean(dat$y)
[1] 3.477209
Fit an intercept-only GLM with the gamma family and the default link function. Note that the family function for the gamma distribution is called Gamma()
(not gamma()
).
The ‘dispersion estimate’ listed in the model output is an estimate of the reciprocal of the shape parameter (1 / shape
).
The confidence intervals calculated by confint()
for glmmTMB
models by default are asymptotic intervals. They are Wald confidence intervals, basically equal to the estimate (on the link scale) ± 1.96 standard errors.
We can get better confidence intervals by using the argument method = "profile"
in confint()
. This will calculate profile likelihood confidence intervals, which are more accurate (especially if we use non-default link functions, where the sample distribution of the coefficient estimates on the link scale can be very asymmetric).