Module 8: Dispersion modelling

Author

Karl Ove Hufthammer

Introduction

Let’s revisit our example from module 2 on lung capacity and smoking in young people. Recall that we seemed to have non-constant error variance (heteroscedasticity). We will fit a dispersion model to take this into account.

The original model

We start with our original, linear model.

library(GLMsData)
library(glmmTMB)
library(marginaleffects)
data(lungcap)
mod_lin = glmmTMB(FEV ~ Age, data = lungcap)
plot_predictions(mod_lin, "Age",
                 vcov = vcov(mod_lin), points = .3)

We have earlier look at the residual plot, which indicated some problems:

  1. A non-linear effect
  2. The variance increasing with age
scatter.smooth(predict(mod_lin), residuals(mod_lin))

DHARMa standard residual plots

An alternative visualisation for model diagnostics is the DHARMa standard residual plots:

library(DHARMa)
sim_resids = simulateResiduals(mod_lin)
plot(sim_resids)

See the DHARMAr vignette for an introduction. But briefly, simulateResiduals() simulates residuals from the fitted model (just like we did for the dispersion test). The plot() function then draws a couple of figures and runs a few tests on these residuals. If the model fits, the data in the figures should show some specific patterns and the P-values from the tests shouldn’t be too small.

The plot on the left is a QQ plot. If the residuals have the correct distribution (here: a Gaussian distribution) the points should lie (approximately) on the diagonal line. Here they almost do, so that’s good.

The plot on the right shows DHARMa residuals against the (rank-transformed) predicted values. They are similar to the residuals from our original plot, but are transformed so that if the model fits, they should have a uniform distribution on 0–1 (conditional on the model predictions, i.e., a uniform distribution in the vertical direction on the plot for each predicted value on the ‘x’ axis).

The red lines show the estimated quartiles (first quartile, median, third quartile). If the model fits, they should be horizontal, black lines centred at 0.25, 0.50 and 0.75, respectively. Since they’re clearly non-horizontal (and red), this indicates a nonlinearity; the functional form of the regression is not correct.

The red stars correspond to observations that are deemed outliers; they have improbably large or small values given the model. They occur on the right side of the panel, which is also in line with our original residual plot (the variation around the fitted line seemed to be higher for higher predicted values). We also see that the variation in the residuals is too low on the left of the plot; the points don’t fill the entire 0–1 range.

In summary, the model doesn’t fit very well.

Why use DHARMa plots instead of normal residual plots?

An advantage of DHARMa standard residual plots compared to normal residual plots is that they work for all types of models, like GLMs and GLMMs, not just linear models (LMs).

Note

We have heteroscedasticity, and we might suspect that a dispersion test would have detected this. However, it doesn’t:

testDispersion(sim_resids)


    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 0.99925, p-value = 0.992
alternative hypothesis: two.sided

The reason is basically this: The SD for the error term is low for low values of age, and high for high values of age. The model estimates a constant SD, which will fall someplace in the middle. And the SD of the actual residuals will naturally fall around the middle of the SDs from the simulated residuals (from a model which assumes a constant variance).

This example shows that a dispersion test can’t detect all forms of heteroscedasticity. Plotting DHARMa standard residual plots will be useful.

A non-linear model

We saw a non-linear pattern in the residuals. We’ll just apply a quick fix – fit the model with age as a third-degree polynomial.

mod_nonlin = update(mod_lin, . ~ poly(Age, 3))
plot_predictions(mod_nonlin, "Age",
  vcov = vcov(mod_nonlin), points = .3)

The median of the DHARMa residuals looks much better (and the black colour indicates that it is no longer statistically significantly different from a horizontal line centred at 0.50):

sim_resids = simulateResiduals(mod_nonlin)
plot(sim_resids)

But we still have too many outliers, too few points near 0 and 1 for low predicted values, and the 1st and 3rd quartiles are clearly not horizontal. All this indicates that the error SD is non-constant and increases with the predicted value.

A simple dispersion model

A dispersion model to the rescue! The glmmTMB() function fits a model for the log of the variance (or, more generally, for the log of the dispersion parameter):

\[\log(\sigma^2) = 2\log(\sigma)\]

Since the error SD increases with age, let’s try a simple linear model:

\[\log{\sigma^2} = \beta_0 + \beta_1 \cdot \text{Age}\]

We use the dispformula argument:

mod_disp = update(mod_nonlin, dispformula = ~ Age)
summary(mod_disp)
 Family: gaussian  ( identity )
Formula:          FEV ~ poly(Age, 3)
Dispersion:           ~Age
Data: lungcap

     AIC      BIC   logLik deviance df.resid 
   966.7    993.6   -477.4    954.7      648 


Conditional model:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.63652    0.02207  119.47  < 2e-16 ***
poly(Age, 3)1 16.71131    0.75145   22.24  < 2e-16 ***
poly(Age, 3)2 -2.60048    0.80021   -3.25  0.00116 ** 
poly(Age, 3)3 -2.57368    0.57800   -4.45 8.48e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dispersion model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.58516    0.20624  -17.38   <2e-16 ***
Age          0.22224    0.02001   11.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercise 1

Based on the table for the Dispersion model in the model output, does there seem to be an effect of age on the SD (error term SD)?

Exercise 2
  1. What is the estimated error term SD based on the mod_nonlin model?

  2. What is the corresponding SD for observations from 10-year-old subjects based on the dispersion coefficients from the mod_disp model?

Note that the mean age in the dataset is approx. 10, so your answer to a) and b) should be similar. (Also remember to take the square root of any variances extracted.)

  1. What is the corresponding SD for observations from 18-year-old subjects based on the mod_disp model?

Except for a few remaining outliers, the model now looks perfect:

sim_resids = simulateResiduals(mod_disp)
plot(sim_resids)

This means that we can trust the estimates and confidence intervals for the parameters, and we can trust prediction intervals generated from the model.

Note

The coefficient estimates are not too different in this new model compared to the mod_nonlin model. And fitting a linear model based on an assumption of a constant error term SD is actually valid as long as the functional form is correct (in that we get consistent parameter estimates – estimates that converge to the real values as the number of observations tends to infinity).

But it’s not an efficient use of the data. We get unnecessarily unprecise estimates (and unreliable confidence intervals).

Since we assume that the SD of the error term is constant, all observations basically get equal weight when estimating the coefficient parameters. With the dispersion model, on the other hand, ‘imprecise’ observations (here corresponding to the older subjects) will get less weight than ’precise‘ observations, and this gives us more precise parameter estimates.

More dispersion parameters

Even though the model seemed perfect, there is some evidence of the dispersion differing by gender (but not by smoking status):

plotResiduals(sim_resids, lungcap$Gender, rank = FALSE)

plotResiduals(sim_resids, lungcap$Smoke, rank = FALSE)

(This is probably because with didn’t include gender as a predictor in the fixed effects in the model, something we should have done. But for this example, we’ll pretend that it’s a real problem.)

Exercise 3

Add gender and smoking as additional predictors in (only) the dispersion part of the model.

Based on model estimates

  1. Does gender seem to have an effect on the dispersion?
  2. Does smoking seem to have an effect on the dispersion?
Exercise 4

Refit the model, but now include only age and gender as predictors of the dispersion parameter. Call the model mod_disp_age_gender.

Run the following code to show how the variance varies with age and gender:

library(marginaleffects)
plot_predictions(mod_disp_age_gender,
  condition = c("Age", "Gender"),
  type = "disp", vcov = FALSE
)

Dispersion models for other distribution families

It’s also possible to fit dispersion models for GLMs, e.g., for negative binomial models. Then the dispersion formula estimates the log of the dispersion parameter (for the Gaussian distribution, the dispersion parameter is the variance). See the ‘Details’ section under the ?family_glmmTMB help page for more information.

However, for linear mixed-effects models, modelling the dispersion is problematic, as the dispersion formula will compete with the variances for the random effects.