library(tidyverse)
= expand_grid(
d_par_grid mu = c(0.2, .3, .5, .6, .9),
phi = c(.9, 2, 3, 8, 20)
)
= function(mu, phi) {
calc_density_beta # dbeta() uses a different parameterisation
# with two shape parameters,
= mu * phi
shape1 = phi - shape1
shape2 = seq(0.0001, .9999, length = 200)
x = dbeta(x, shape1, shape2)
y tibble(x, y)
}
# Calculate densities for all parameter combinations
= d_par_grid %>%
d_densities mutate(dens = map2(mu, phi, calc_density_beta)) %>%
unnest(dens)
# Plot the densities
ggplot(d_densities, aes(x = x, y = y)) +
geom_line() +
facet_wrap(~ phi + mu,
scales = "free",
ncol = n_distinct(d_par_grid$mu),
labeller = label_both
+
) theme_light() +
theme(
panel.grid.minor = element_blank(),
panel.grid.major = element_blank()
)
Module 9: Beta regression
Introduction
The beta distribution is a flexible continuous distribution defined on (0,1). It is typically used for modelling proportions.
The version implemented in glmmTMB
takes two parameters:
- μ (mu): mean parameter
- φ (or ϕ, phi): dispersion parameter
Varying these parameters, we get a great variety of density functions:
Note that for a given μ parameter, increasing the dispersion parameter φ decreases the variance. At the same time, it makes the distribution more symmetric (except when μ = ½, where the distribution is already symmetric).
For a given φ parameter, the variance is greatest for μ = ½ and decreases as μ tends towards 0 or 1.
Example data
We’ll look at dataset on reading accuracy for nondyslexic and dyslexic Australian children. As always, we first plot the data:
data("ReadingSkills", package = "betareg")
library(ggplot2)
ggplot(
ReadingSkills,aes(x = iq, y = accuracy, colour = dyslexia)
+
) geom_point(size = 3)
As expected, dyslexia is clearly associated with reading accuracy. There also seems to be evidence of IQ being associated with accuracy: Children with higher IQ have on average higher reading accuracy (this is true at least for the non-dyslexic children – perhaps we need an interaction term?).
Fitting a simple beta regression
Reading accuracy is recorded on a 0–1 scale, so it might be a good fit for a beta regression. We also note when the mean is close to 1 (the red points in the upper right corner), the variance is small, which is consistent with the beta distribution.
We fit a simple beta regression with an interaction between dyslexia and IQ. Due to some special (contrast) coding of the dyslexia
variable, we’ll first create a new dataset where this variable has a normal factor coding. (This doesn’t really affect the model, just the coefficient parameterisation.)
= ReadingSkills
d_reading $dyslexia = factor(d_reading$dyslexia)
d_reading
library(glmmTMB)
= glmmTMB(accuracy ~ dyslexia * iq,
mod data = d_reading, family = beta_family()
)summary(mod)
Family: beta ( logit )
Formula: accuracy ~ dyslexia * iq
Data: d_reading
AIC BIC logLik deviance df.resid
-92.7 -83.8 51.4 -102.7 39
Dispersion parameter for beta family (): 11.1
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3074 0.2070 11.149 < 2e-16 ***
dyslexiayes -1.9473 0.2659 -7.323 2.43e-13 ***
iq 0.3793 0.1973 1.923 0.0545 .
dyslexiayes:iq -0.4371 0.2565 -1.704 0.0884 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(DHARMa)
plot(simulateResiduals(mod))
Modelling the dispersion
With only 44 datapoints, it’s difficult to interpret residual plots. The smoothed residual quartile lines may appear very wiggly (non-horizontal) just because they’re based on so little data.
But at least one of the lines was statistically significantly different from a horizontal line (since it’s displayed in red), and the points don’t really seem to fill the (0,1) × (0,1) square. This is evidence of the variance not being consistent with a beta distribution that has a fixed dispersion parameter φ.
Luckily, we can also fit a model for the dispersion parameter.
Plotting the predicted values
We can also easily plot the predicted values:
library(marginaleffects)
plot_predictions(mod_disp,
condition = c("iq", "dyslexia"),
points = 1, vcov = FALSE
)
The model seems to give reasonable estimates. (We ignore the extrapolation of the red line to IQ values where we have no data!)
Confidence intervals for the predicted means
As long as plot_predictions()
with vcov = TRUE
doesn’t work for glmmTMB models, we have to manually calculate and plot confidence intervals for the predicted means.
First, we need the standard errors for the predictions. The help file ?predict.glmmTMB
explains that we can use se.fit
argument for this:
= predict(mod_disp, se.fit = TRUE, type="response")
preds preds
$fit
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.9462755 0.9256211 0.9126854 0.9655535 0.6605412 0.6204097 0.7763743
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.5785609 0.8311079 0.9519341 0.9172459 0.9692039 0.9256211 0.9875914
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.7449215 0.9256211 0.8311079 0.9852830 0.9126854 0.9825276 0.9655535
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.7960712 0.7763743 0.9256211 0.9860893 0.5958226 0.5974620 0.6105164
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.5999273 0.6121350 0.5875734 0.5792746 0.5684366 0.6193977 0.5974620
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.6145684 0.6186046 0.6298096 0.6298096 0.6031928 0.6282200 0.6250123
mu_predict mu_predict
0.6048221 0.6202105
$se.fit
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.010597004 0.014602764 0.018018312 0.008388804 0.121579557 0.136968498
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.072164982 0.151509239 0.048378790 0.009828339 0.016744236 0.008042230
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.014602764 0.005502585 0.086000546 0.014602764 0.048378790 0.005976533
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.018018312 0.006456222 0.008388804 0.063505620 0.072164982 0.014602764
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.005819379 0.011729445 0.011833113 0.015558470 0.012169490 0.016271917
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.012693070 0.015679479 0.021213501 0.019831594 0.011833113 0.017407235
mu_predict mu_predict mu_predict mu_predict mu_predict mu_predict
0.019420688 0.025523412 0.025523412 0.012918026 0.024627988 0.022844939
mu_predict mu_predict
0.013403973 0.020257052
We can get pointwise approximate 95% confidence intervals by just adding or subtracting 1.96 standard errors to the predictions (we are then calculating Wald confidence intervals):
= .95
conf_level = qnorm((1 - conf_level) / 2, lower.tail = FALSE) # ~1.96
q
$pred = preds$fit
d_reading$se.fit = preds$se.fit
d_reading= mutate(d_reading,
d_reading conf_lower = pred - q * se.fit,
conf_higher = pred + q * se.fit
)
ggplot(
d_reading,aes(
x = iq, y = pred,
ymin = conf_lower, ymax = conf_higher,
colour = dyslexia, fill = dyslexia
)+
) geom_ribbon(alpha = .2, colour = NA) +
geom_line() +
geom_point(aes(y = accuracy))
However, this isn’t a very good confidence interval. It’s a symmetric interval that assumes that the predictions are conditionally normally distributed. This is only true asymptotically. (For finite samples, we can even get lower/upper confidence limits outside the possible range of the outcome variable, i.e., here > 1 or < 0.)
We can get a better interval by doing the calculations on the link scale (where the predictions are typically closer to being normally distributed).
Tip: A shortcut for adding the predictions (on the link scale) and their standard errors is to use the augment()
function:
library(broom.mixed)
augment(mod_disp)
# A tibble: 44 × 6
accuracy dyslexia iq .fitted .se.fit .resid
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 0.884 no 0.827 2.87 0.208 -0.0624
2 0.765 no 0.59 2.52 0.212 -0.160
3 0.915 no 0.471 2.35 0.226 0.00239
4 0.984 no 1.14 3.33 0.252 0.0182
5 0.884 no -0.676 0.666 0.542 0.223
6 0.709 no -0.795 0.491 0.582 0.0886
7 0.771 no -0.281 1.24 0.416 -0.00489
8 0.99 no -0.914 0.317 0.621 0.411
9 0.99 no -0.043 1.59 0.345 0.159
10 0.99 no 0.907 2.99 0.215 0.0381
# ℹ 34 more rows
(This also adds the residuals in the last column.)
Note that the intervals we have calculated are confidence intervals; they’re not prediction intervals. That is, they are intervals that are supposed to cover the mean response (conditional on IQ and dyslexia), not the individual responses. (And we see that many more than 5% of the individual responses are outside the confidence limits.)
Prediction intervals would be much wider, and their widths would not shrink towards zero as the number of observations increases.
Also note that even though the intervals are shown as bands here, they are pointwise confidence intervals. That is, they are valid for a fixed value of IQ and dyslexia, but they are not guaranteed to cover the entire mean function (at a confidence level of 95%).
To cover the entire mean function, we would need (simultaneous) confidence bands, and they would be (typically only slightly) wider than the pointwise confidence intervals.
Scaling and squeezing outcome data
The beta distribution as implemented in glmmTMB
is only defined on the open interval (0,1). However, we can also fit beta models for data defined on other intervals.
Example: we have a questionnaire with 10 questions, each giving a numeric score on a 1–4 scale. The total score is defined as the sum of the scores from the 10 questions, and will thus have a 10–40 range. We can transform this into a 0–30 range by subtracting 10. And then we can transform it into the 0–1 range required by the beta distribution by dividing the 0–30 score by 30.
In general, if an outcome variable y is defined on the range a–b, we can transform it into the 0–1 range using the formula (y − a) / (b − a).
There is, however, one small problem remaining. The beta distribution is defined on the open interval (0,1), not the closed interval [0,1]. But this is easily fixed by slightly compressing (squeezing) the range to exclude 0 and 1. We can use the following formula:
\[y \cdot k + (1-k)/2\] Here k is the compression factor, e.g., 0.99 or 0.999. Example:
= c(0, .2, .5, .8, 1)
y
= function(y, k) {
squeeze * k + (1 - k) / 2
y
}
squeeze(y, .99)
[1] 0.005 0.203 0.500 0.797 0.995
squeeze(y, .999)
[1] 0.0005 0.2003 0.5000 0.7997 0.9995
In their paper A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables (2006), Smithson & Verkuilen recommend using the \(k = \frac{n-1}{n}\), where \(n\) is the number of observations. For our small example vector, we get:
= length(y)
n squeeze(y, k = (n - 1) / n)
[1] 0.10 0.26 0.50 0.74 0.90
And in the reading accuracy dataset, the original accuracy score was also transformed into (0,1) range by such a procedure. (The formula quoted ?ReadingSkills
help page is wrong; it should say + 0.5
instead of - 0.5
.)
Missing data
All the way back in module 2, we mentioned that running the residuals()
function would not be safe if there were missing data. As promised, we finally show how to avoid the problem.
Let’s make one outcome and one predictor value missing, and then refit the model:
= d_reading
d_miss $accuracy[2] = NA
d_miss$iq[5] = NA
d_misshead(d_miss)
accuracy dyslexia iq pred se.fit conf_lower conf_higher
1 0.88386 no 0.827 0.9462755 0.010597004 0.9255058 0.9670453
2 NA no 0.590 0.9256211 0.014602764 0.8970002 0.9542420
3 0.91508 no 0.471 0.9126854 0.018018312 0.8773701 0.9480006
4 0.98376 no 1.144 0.9655535 0.008388804 0.9491118 0.9819953
5 0.88386 no NA 0.6605412 0.121579557 0.4222496 0.8988327
6 0.70905 no -0.795 0.6204097 0.136968498 0.3519564 0.8888630
= update(mod_disp, data = d_miss)
mod_miss summary(mod_miss)
Family: beta ( logit )
Formula: accuracy ~ dyslexia * iq
Dispersion: ~dyslexia * iq
Data: d_miss
AIC BIC logLik deviance df.resid
-118.5 -104.6 67.3 -134.5 34
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6796 0.3344 5.023 5.09e-07 ***
dyslexiayes -1.2987 0.3379 -3.843 0.000121 ***
iq 1.5441 0.2998 5.150 2.60e-07 ***
dyslexiayes:iq -1.6303 0.3052 -5.342 9.20e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dispersion model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1994 0.4021 2.983 0.002857 **
dyslexiayes 3.6083 0.6451 5.593 2.23e-08 ***
iq 2.3621 0.6360 3.714 0.000204 ***
dyslexiayes:iq -1.5361 0.8631 -1.780 0.075123 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model works fine; it’s just based on fewer observations:
nobs(mod_disp)
[1] 44
nobs(mod_miss)
[1] 42
But when we try to extract residuals and predictions, we get a problem:
= residuals(mod_miss)
resids = predict(mod_miss, type = "response")
preds length(resids)
[1] 42
length(preds)
[1] 42
We only get values for the observations that don’t have missing values. This means we can’t plot real vs. predicted values using code like this:
plot(preds, d_miss$accuracy)
The solution is to use the na.action
argument with the na.exclude
value:
= update(mod_miss, na.action = na.exclude) mod_miss_na
The model will be the same, but the extractor functions will automatically insert NA
values when necessary:
= residuals(mod_miss_na)
resids_na = predict(mod_miss_na, type = "response")
preds_na length(resids_na)
[1] 44
length(preds_na)
[1] 44
head(resids_na)
1 2 3 4 5 6
-0.06671317 NA -0.00227113 0.01464728 NA 0.09790715
head(preds_na)
[1] 0.9505732 NA 0.9173511 0.9691127 NA 0.6111428
plot(preds_na, d_miss$accuracy)
Note that an NA
value was inserted as the second element of preds_na
, even though it was possible to predict a value here (only the outcome was missing, not the predictor). An NA
value is inserted for all observations not used when fitting the data.
If we want to get a prediction even for this value, we can use the newdata
argument and set its value to the same dataset used to fit the model.
= predict(mod_miss_na, newdata = d_miss, type = "response")
preds_na_ok head(preds_na_ok)
[1] 0.9505732 0.9302562 0.9173511 0.9691127 NA 0.6111428
Now only the fifth value is missing. (We can’t predict something if we’re missing the predictor.)
Fitting a model with data that have missing values is OK if the data are missing purely by chance (‘missing completely at random’ – MCAR). But typically, they are not. Then things get difficult! We would need a complete second course on how to handle it, but a user-friendly R-centred introductory book is Flexible Imputation of Missing Data by Stef van Buuren, which you can read online for free.