library(GLMsData)
library(glmmTMB)
library(marginaleffects)
data(lungcap)
= glmmTMB(FEV ~ Age, data = lungcap)
mod_lin plot_predictions(mod_lin, "Age",
vcov = vcov(mod_lin), points = .3)
Module 8: Dispersion modelling
Introduction
Let’s revisit our example from module 2 on lung capacity and smoking in young people. Recall that we seemed to have non-constant error variance (heteroscedasticity). We will fit a dispersion model to take this into account.
The original model
We start with our original, linear model.
We have earlier look at the residual plot, which indicated some problems:
- A non-linear effect
- The variance increasing with age
scatter.smooth(predict(mod_lin), residuals(mod_lin))
DHARMa standard residual plots
An alternative visualisation for model diagnostics is the DHARMa standard residual plots:
library(DHARMa)
= simulateResiduals(mod_lin)
sim_resids plot(sim_resids)
See the DHARMAr vignette for an introduction. But briefly, simulateResiduals()
simulates residuals from the fitted model (just like we did for the dispersion test). The plot()
function then draws a couple of figures and runs a few tests on these residuals. If the model fits, the data in the figures should show some specific patterns and the P-values from the tests shouldn’t be too small.
The plot on the left is a QQ plot. If the residuals have the correct distribution (here: a Gaussian distribution) the points should lie (approximately) on the diagonal line. Here they almost do, so that’s good.
The plot on the right shows DHARMa residuals against the (rank-transformed) predicted values. They are similar to the residuals from our original plot, but are transformed so that if the model fits, they should have a uniform distribution on 0–1 (conditional on the model predictions, i.e., a uniform distribution in the vertical direction on the plot for each predicted value on the ‘x’ axis).
The red lines show the estimated quartiles (first quartile, median, third quartile). If the model fits, they should be horizontal, black lines centred at 0.25, 0.50 and 0.75, respectively. Since they’re clearly non-horizontal (and red), this indicates a nonlinearity; the functional form of the regression is not correct.
The red stars correspond to observations that are deemed outliers; they have improbably large or small values given the model. They occur on the right side of the panel, which is also in line with our original residual plot (the variation around the fitted line seemed to be higher for higher predicted values). We also see that the variation in the residuals is too low on the left of the plot; the points don’t fill the entire 0–1 range.
In summary, the model doesn’t fit very well.
Why use DHARMa plots instead of normal residual plots?
An advantage of DHARMa standard residual plots compared to normal residual plots is that they work for all types of models, like GLMs and GLMMs, not just linear models (LMs).
We have heteroscedasticity, and we might suspect that a dispersion test would have detected this. However, it doesn’t:
testDispersion(sim_resids)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.99925, p-value = 0.992
alternative hypothesis: two.sided
The reason is basically this: The SD for the error term is low for low values of age, and high for high values of age. The model estimates a constant SD, which will fall someplace in the middle. And the SD of the actual residuals will naturally fall around the middle of the SDs from the simulated residuals (from a model which assumes a constant variance).
This example shows that a dispersion test can’t detect all forms of heteroscedasticity. Plotting DHARMa standard residual plots will be useful.
A non-linear model
We saw a non-linear pattern in the residuals. We’ll just apply a quick fix – fit the model with age as a third-degree polynomial.
= update(mod_lin, . ~ poly(Age, 3))
mod_nonlin plot_predictions(mod_nonlin, "Age",
vcov = vcov(mod_nonlin), points = .3)
The median of the DHARMa residuals looks much better (and the black colour indicates that it is no longer statistically significantly different from a horizontal line centred at 0.50):
= simulateResiduals(mod_nonlin)
sim_resids plot(sim_resids)
But we still have too many outliers, too few points near 0 and 1 for low predicted values, and the 1st and 3rd quartiles are clearly not horizontal. All this indicates that the error SD is non-constant and increases with the predicted value.
A simple dispersion model
A dispersion model to the rescue! The glmmTMB()
function fits a model for the log of the variance (or, more generally, for the log of the dispersion parameter):
\[\log(\sigma^2) = 2\log(\sigma)\]
Since the error SD increases with age, let’s try a simple linear model:
\[\log{\sigma^2} = \beta_0 + \beta_1 \cdot \text{Age}\]
We use the dispformula
argument:
= update(mod_nonlin, dispformula = ~ Age)
mod_disp summary(mod_disp)
Family: gaussian ( identity )
Formula: FEV ~ poly(Age, 3)
Dispersion: ~Age
Data: lungcap
AIC BIC logLik deviance df.resid
966.7 993.6 -477.4 954.7 648
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.63652 0.02207 119.47 < 2e-16 ***
poly(Age, 3)1 16.71131 0.75145 22.24 < 2e-16 ***
poly(Age, 3)2 -2.60048 0.80021 -3.25 0.00116 **
poly(Age, 3)3 -2.57368 0.57800 -4.45 8.48e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dispersion model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.58516 0.20624 -17.38 <2e-16 ***
Age 0.22224 0.02001 11.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Except for a few remaining outliers, the model now looks perfect:
= simulateResiduals(mod_disp)
sim_resids plot(sim_resids)
This means that we can trust the estimates and confidence intervals for the parameters, and we can trust prediction intervals generated from the model.
The coefficient estimates are not too different in this new model compared to the mod_nonlin
model. And fitting a linear model based on an assumption of a constant error term SD is actually valid as long as the functional form is correct (in that we get consistent parameter estimates – estimates that converge to the real values as the number of observations tends to infinity).
But it’s not an efficient use of the data. We get unnecessarily unprecise estimates (and unreliable confidence intervals).
Since we assume that the SD of the error term is constant, all observations basically get equal weight when estimating the coefficient parameters. With the dispersion model, on the other hand, ‘imprecise’ observations (here corresponding to the older subjects) will get less weight than ’precise‘ observations, and this gives us more precise parameter estimates.
More dispersion parameters
Even though the model seemed perfect, there is some evidence of the dispersion differing by gender (but not by smoking status):
plotResiduals(sim_resids, lungcap$Gender, rank = FALSE)
plotResiduals(sim_resids, lungcap$Smoke, rank = FALSE)
(This is probably because with didn’t include gender as a predictor in the fixed effects in the model, something we should have done. But for this example, we’ll pretend that it’s a real problem.)
Dispersion models for other distribution families
It’s also possible to fit dispersion models for GLMs, e.g., for negative binomial models. Then the dispersion formula estimates the log of the dispersion parameter (for the Gaussian distribution, the dispersion parameter is the variance). See the ‘Details’ section under the ?family_glmmTMB
help page for more information.
However, for linear mixed-effects models, modelling the dispersion is problematic, as the dispersion formula will compete with the variances for the random effects.