Module 9: Beta regression

Author

Karl Ove Hufthammer

Introduction

The beta distribution is a flexible continuous distribution defined on (0,1). It is typically used for modelling proportions.

The version implemented in glmmTMB takes two parameters:

  • μ (mu): mean parameter
  • φ (or ϕ, phi): dispersion parameter

Varying these parameters, we get a great variety of density functions:

library(tidyverse)
d_par_grid = expand_grid(
  mu = c(0.2, .3, .5, .6, .9),
  phi = c(.9, 2, 3, 8, 20)
)

calc_density_beta = function(mu, phi) {
  # dbeta() uses a different parameterisation
  # with two shape parameters,
  shape1 = mu * phi
  shape2 = phi - shape1
  x = seq(0.0001, .9999, length = 200)
  y = dbeta(x, shape1, shape2)
  tibble(x, y)
}

# Calculate densities for all parameter combinations
d_densities = d_par_grid %>%
  mutate(dens = map2(mu, phi, calc_density_beta)) %>%
  unnest(dens)

# Plot the densities
ggplot(d_densities, aes(x = x, y = y)) +
  geom_line() +
  facet_wrap(~ phi + mu,
    scales = "free",
    ncol = n_distinct(d_par_grid$mu),
    labeller = label_both
  ) +
  theme_light() +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank()
  )

Note that for a given μ parameter, increasing the dispersion parameter φ decreases the variance. At the same time, it makes the distribution more symmetric (except when μ = ½, where the distribution is already symmetric).

For a given φ parameter, the variance is greatest for μ = ½ and decreases as μ tends towards 0 or 1.

Exercise 1

Examine the density plots. One of them is equal to the density plot for a uniform distribution on (0,1) (where all values of x are ‘equally likely’). Which set of parameters for the beta distribution corresponds to the uniform distribution?

Exercise 2

Based on your answer in the previous question and the information in ?family_glmmTMB, what is the variance of the uniform distribution on (0,1).

Example data

We’ll look at dataset on reading accuracy for nondyslexic and dyslexic Australian children. As always, we first plot the data:

data("ReadingSkills", package = "betareg")
library(ggplot2)
ggplot(
  ReadingSkills,
  aes(x = iq, y = accuracy, colour = dyslexia)
) +
  geom_point(size = 3)

As expected, dyslexia is clearly associated with reading accuracy. There also seems to be evidence of IQ being associated with accuracy: Children with higher IQ have on average higher reading accuracy (this is true at least for the non-dyslexic children – perhaps we need an interaction term?).

Fitting a simple beta regression

Reading accuracy is recorded on a 0–1 scale, so it might be a good fit for a beta regression. We also note when the mean is close to 1 (the red points in the upper right corner), the variance is small, which is consistent with the beta distribution.

We fit a simple beta regression with an interaction between dyslexia and IQ. Due to some special (contrast) coding of the dyslexia variable, we’ll first create a new dataset where this variable has a normal factor coding. (This doesn’t really affect the model, just the coefficient parameterisation.)

d_reading = ReadingSkills
d_reading$dyslexia = factor(d_reading$dyslexia)

library(glmmTMB)
mod = glmmTMB(accuracy ~ dyslexia * iq,
  data = d_reading, family = beta_family()
)
summary(mod)
 Family: beta  ( logit )
Formula:          accuracy ~ dyslexia * iq
Data: d_reading

     AIC      BIC   logLik deviance df.resid 
   -92.7    -83.8     51.4   -102.7       39 


Dispersion parameter for beta family (): 11.1 

Conditional model:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      2.3074     0.2070  11.149  < 2e-16 ***
dyslexiayes     -1.9473     0.2659  -7.323 2.43e-13 ***
iq               0.3793     0.1973   1.923   0.0545 .  
dyslexiayes:iq  -0.4371     0.2565  -1.704   0.0884 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercise 3

Create a DHARMa standard residual plot for the model. Does the model seem to fit the data?

library(DHARMa)
plot(simulateResiduals(mod))

Modelling the dispersion

With only 44 datapoints, it’s difficult to interpret residual plots. The smoothed residual quartile lines may appear very wiggly (non-horizontal) just because they’re based on so little data.

But at least one of the lines was statistically significantly different from a horizontal line (since it’s displayed in red), and the points don’t really seem to fill the (0,1) × (0,1) square. This is evidence of the variance not being consistent with a beta distribution that has a fixed dispersion parameter φ.

Luckily, we can also fit a model for the dispersion parameter.

Exercise 4

Extend the above model by also modelling the dispersion parameter. Use the same predictors (main effects and interaction) as when modelling the mean. Name the new model mod_disp.

Exercise 5

Plot a DHARMa residual plot for the new model. Does the model seem to fit the data?

Exercise 6

Is there (strong) evidence of an interaction effect between IQ and dyslexia on the a) mean accuracy scores, and/or b) dispersion parameter (on their respective scales)?

Plotting the predicted values

We can also easily plot the predicted values:

library(marginaleffects)
plot_predictions(mod_disp,
  condition = c("iq", "dyslexia"),
  points = 1, vcov = FALSE
)

The model seems to give reasonable estimates. (We ignore the extrapolation of the red line to IQ values where we have no data!)

Confidence intervals for the predicted means

As long as plot_predictions() with vcov = TRUE doesn’t work for glmmTMB models, we have to manually calculate and plot confidence intervals for the predicted means.

First, we need the standard errors for the predictions. The help file ?predict.glmmTMB explains that we can use se.fit argument for this:

preds = predict(mod_disp, se.fit = TRUE, type="response")
preds
$fit
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict 
 0.9462755  0.9256211  0.9126854  0.9655535  0.6605412  0.6204097  0.7763743 
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict 
 0.5785609  0.8311079  0.9519341  0.9172459  0.9692039  0.9256211  0.9875914 
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict 
 0.7449215  0.9256211  0.8311079  0.9852830  0.9126854  0.9825276  0.9655535 
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict 
 0.7960712  0.7763743  0.9256211  0.9860893  0.5958226  0.5974620  0.6105164 
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict 
 0.5999273  0.6121350  0.5875734  0.5792746  0.5684366  0.6193977  0.5974620 
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict 
 0.6145684  0.6186046  0.6298096  0.6298096  0.6031928  0.6282200  0.6250123 
mu_predict mu_predict 
 0.6048221  0.6202105 

$se.fit
 mu_predict  mu_predict  mu_predict  mu_predict  mu_predict  mu_predict 
0.010597004 0.014602764 0.018018312 0.008388804 0.121579557 0.136968498 
 mu_predict  mu_predict  mu_predict  mu_predict  mu_predict  mu_predict 
0.072164982 0.151509239 0.048378790 0.009828339 0.016744236 0.008042230 
 mu_predict  mu_predict  mu_predict  mu_predict  mu_predict  mu_predict 
0.014602764 0.005502585 0.086000546 0.014602764 0.048378790 0.005976533 
 mu_predict  mu_predict  mu_predict  mu_predict  mu_predict  mu_predict 
0.018018312 0.006456222 0.008388804 0.063505620 0.072164982 0.014602764 
 mu_predict  mu_predict  mu_predict  mu_predict  mu_predict  mu_predict 
0.005819379 0.011729445 0.011833113 0.015558470 0.012169490 0.016271917 
 mu_predict  mu_predict  mu_predict  mu_predict  mu_predict  mu_predict 
0.012693070 0.015679479 0.021213501 0.019831594 0.011833113 0.017407235 
 mu_predict  mu_predict  mu_predict  mu_predict  mu_predict  mu_predict 
0.019420688 0.025523412 0.025523412 0.012918026 0.024627988 0.022844939 
 mu_predict  mu_predict 
0.013403973 0.020257052 

We can get pointwise approximate 95% confidence intervals by just adding or subtracting 1.96 standard errors to the predictions (we are then calculating Wald confidence intervals):

conf_level = .95
q = qnorm((1 - conf_level) / 2, lower.tail = FALSE) # ~1.96

d_reading$pred = preds$fit
d_reading$se.fit = preds$se.fit
d_reading = mutate(d_reading,
  conf_lower = pred - q * se.fit,
  conf_higher = pred + q * se.fit
)

ggplot(
  d_reading,
  aes(
    x = iq, y = pred,
    ymin = conf_lower, ymax = conf_higher,
    colour = dyslexia, fill = dyslexia
  )
) +
  geom_ribbon(alpha = .2, colour = NA) +
  geom_line() +
  geom_point(aes(y = accuracy))

However, this isn’t a very good confidence interval. It’s a symmetric interval that assumes that the predictions are conditionally normally distributed. This is only true asymptotically. (For finite samples, we can even get lower/upper confidence limits outside the possible range of the outcome variable, i.e., here > 1 or < 0.)

We can get a better interval by doing the calculations on the link scale (where the predictions are typically closer to being normally distributed).

Tip: A shortcut for adding the predictions (on the link scale) and their standard errors is to use the augment() function:

library(broom.mixed)
augment(mod_disp)
# A tibble: 44 × 6
   accuracy dyslexia     iq .fitted .se.fit   .resid
      <dbl> <fct>     <dbl>   <dbl>   <dbl>    <dbl>
 1    0.884 no        0.827   2.87    0.208 -0.0624 
 2    0.765 no        0.59    2.52    0.212 -0.160  
 3    0.915 no        0.471   2.35    0.226  0.00239
 4    0.984 no        1.14    3.33    0.252  0.0182 
 5    0.884 no       -0.676   0.666   0.542  0.223  
 6    0.709 no       -0.795   0.491   0.582  0.0886 
 7    0.771 no       -0.281   1.24    0.416 -0.00489
 8    0.99  no       -0.914   0.317   0.621  0.411  
 9    0.99  no       -0.043   1.59    0.345  0.159  
10    0.99  no        0.907   2.99    0.215  0.0381 
# ℹ 34 more rows

(This also adds the residuals in the last column.)

Exercise 7

Use the augment() function on the model, calculate 95% confidence intervals on the link scale and backtransform to the response scale (0–1). Create a similar plot as the one above, showing both the original data, the predicted means as a function of IQ and dyslexia and 95% confidence intervals around the means.

(For this dataset and this model, the confidence intervals will actually be quite similar to the earlier confidence intervals. In general, they can be very different, especially for small datasets and very non-normal distributions.)

Note

Note that the intervals we have calculated are confidence intervals; they’re not prediction intervals. That is, they are intervals that are supposed to cover the mean response (conditional on IQ and dyslexia), not the individual responses. (And we see that many more than 5% of the individual responses are outside the confidence limits.)

Prediction intervals would be much wider, and their widths would not shrink towards zero as the number of observations increases.

Also note that even though the intervals are shown as bands here, they are pointwise confidence intervals. That is, they are valid for a fixed value of IQ and dyslexia, but they are not guaranteed to cover the entire mean function (at a confidence level of 95%).

To cover the entire mean function, we would need (simultaneous) confidence bands, and they would be (typically only slightly) wider than the pointwise confidence intervals.

Scaling and squeezing outcome data

The beta distribution as implemented in glmmTMB is only defined on the open interval (0,1). However, we can also fit beta models for data defined on other intervals.

Example: we have a questionnaire with 10 questions, each giving a numeric score on a 1–4 scale. The total score is defined as the sum of the scores from the 10 questions, and will thus have a 10–40 range. We can transform this into a 0–30 range by subtracting 10. And then we can transform it into the 0–1 range required by the beta distribution by dividing the 0–30 score by 30.

In general, if an outcome variable y is defined on the range ab, we can transform it into the 0–1 range using the formula (y − a) / (b − a).

There is, however, one small problem remaining. The beta distribution is defined on the open interval (0,1), not the closed interval [0,1]. But this is easily fixed by slightly compressing (squeezing) the range to exclude 0 and 1. We can use the following formula:

\[y \cdot k + (1-k)/2\] Here k is the compression factor, e.g., 0.99 or 0.999. Example:

y = c(0, .2, .5, .8, 1)

squeeze = function(y, k) {
  y * k + (1 - k) / 2
}

squeeze(y, .99)
[1] 0.005 0.203 0.500 0.797 0.995
squeeze(y, .999)
[1] 0.0005 0.2003 0.5000 0.7997 0.9995

In their paper A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables (2006), Smithson & Verkuilen recommend using the \(k = \frac{n-1}{n}\), where \(n\) is the number of observations. For our small example vector, we get:

n = length(y)
squeeze(y, k = (n - 1) / n)
[1] 0.10 0.26 0.50 0.74 0.90

And in the reading accuracy dataset, the original accuracy score was also transformed into (0,1) range by such a procedure. (The formula quoted ?ReadingSkills help page is wrong; it should say + 0.5 instead of - 0.5.)

Exercise 8

Define y as

y = seq(-50, 50, 10)

Assume that the possible range of y is [−50,50]. Transform the y values into the (0,1) range using Smithson & Verkuilen’s recommended squeeze factor.

Missing data

All the way back in module 2, we mentioned that running the residuals() function would not be safe if there were missing data. As promised, we finally show how to avoid the problem.

Let’s make one outcome and one predictor value missing, and then refit the model:

d_miss = d_reading
d_miss$accuracy[2] = NA
d_miss$iq[5] = NA
head(d_miss)
  accuracy dyslexia     iq      pred      se.fit conf_lower conf_higher
1  0.88386       no  0.827 0.9462755 0.010597004  0.9255058   0.9670453
2       NA       no  0.590 0.9256211 0.014602764  0.8970002   0.9542420
3  0.91508       no  0.471 0.9126854 0.018018312  0.8773701   0.9480006
4  0.98376       no  1.144 0.9655535 0.008388804  0.9491118   0.9819953
5  0.88386       no     NA 0.6605412 0.121579557  0.4222496   0.8988327
6  0.70905       no -0.795 0.6204097 0.136968498  0.3519564   0.8888630
mod_miss = update(mod_disp, data = d_miss)
summary(mod_miss)
 Family: beta  ( logit )
Formula:          accuracy ~ dyslexia * iq
Dispersion:                ~dyslexia * iq
Data: d_miss

     AIC      BIC   logLik deviance df.resid 
  -118.5   -104.6     67.3   -134.5       34 


Conditional model:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.6796     0.3344   5.023 5.09e-07 ***
dyslexiayes     -1.2987     0.3379  -3.843 0.000121 ***
iq               1.5441     0.2998   5.150 2.60e-07 ***
dyslexiayes:iq  -1.6303     0.3052  -5.342 9.20e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dispersion model:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.1994     0.4021   2.983 0.002857 ** 
dyslexiayes      3.6083     0.6451   5.593 2.23e-08 ***
iq               2.3621     0.6360   3.714 0.000204 ***
dyslexiayes:iq  -1.5361     0.8631  -1.780 0.075123 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model works fine; it’s just based on fewer observations:

nobs(mod_disp)
[1] 44
nobs(mod_miss)
[1] 42

But when we try to extract residuals and predictions, we get a problem:

resids = residuals(mod_miss)
preds = predict(mod_miss, type = "response")
length(resids)
[1] 42
length(preds)
[1] 42

We only get values for the observations that don’t have missing values. This means we can’t plot real vs. predicted values using code like this:

plot(preds, d_miss$accuracy)

The solution is to use the na.action argument with the na.exclude value:

mod_miss_na = update(mod_miss, na.action = na.exclude)

The model will be the same, but the extractor functions will automatically insert NA values when necessary:

resids_na = residuals(mod_miss_na)
preds_na = predict(mod_miss_na, type = "response")
length(resids_na)
[1] 44
length(preds_na)
[1] 44
head(resids_na)
          1           2           3           4           5           6 
-0.06671317          NA -0.00227113  0.01464728          NA  0.09790715 
head(preds_na)
[1] 0.9505732        NA 0.9173511 0.9691127        NA 0.6111428
plot(preds_na, d_miss$accuracy)

Note that an NA value was inserted as the second element of preds_na, even though it was possible to predict a value here (only the outcome was missing, not the predictor). An NA value is inserted for all observations not used when fitting the data.

If we want to get a prediction even for this value, we can use the newdata argument and set its value to the same dataset used to fit the model.

preds_na_ok = predict(mod_miss_na, newdata = d_miss, type = "response")
head(preds_na_ok)
[1] 0.9505732 0.9302562 0.9173511 0.9691127        NA 0.6111428

Now only the fifth value is missing. (We can’t predict something if we’re missing the predictor.)

Note

Fitting a model with data that have missing values is OK if the data are missing purely by chance (‘missing completely at random’ – MCAR). But typically, they are not. Then things get difficult! We would need a complete second course on how to handle it, but a user-friendly R-centred introductory book is Flexible Imputation of Missing Data by Stef van Buuren, which you can read online for free.