library(glmmTMB)
library(GLMsData)
library(ggplot2)
Module 5: Binomial models
Introduction
In this module, we will look at data where the outcome variable can only take two values. The values are typically called ‘success’ and ‘failure’, though ‘success’ can indicate a bad outcome, e.g., death.
Let’s load the necessary packages:
Our first dataset is from a small experiment on the effect of oxygen deprivation, which we know can cause loss of consciousness. Type ?forces
for details on the dataset.
We are interested in whether a subject’s age is related to the risk of showing signs of losing consciousness. The outcome is coded as 1 (showing signs) and 0 (not showing signs). As always, we first plot our data:
data(gforces)
= ggplot(gforces, aes(x = Age, y = Signs)) +
fig_gforces geom_point(size = 3)
fig_gforces
There seems to be some (very weak) evidence of age having an effect. For example, the three oldest subjects all showed signs of losing consciousness during the experiment. (But the figure would be more convincing if the top-left observation was missing …)
Since the outcome variable (Signs
) is coded 0/1, an estimate of the overall probability (risk) of showing signs of losing consciousness (for a population with this age distribution) is the mean of the outcome variable (why?):
mean(gforces$Signs) # ~63%
[1] 0.625
If the risk varies with age, we might be tempted to fit a simple linear regression model. The regression line would look like this:
+
fig_gforces geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
There are several problems with this approach:
- The regression would predict a risk > 1 (which is impossible) for large values of age (e.g., for age > 55).
- The observations are certainly not normally distributed.
- We know that we have heteroscedasticity, since for a binary/Bernoulli variable with probability of ‘success’ p, the variance is p × (1 − p). So if p varies with age, so does the variance.
We can fix this by using the actual conditional distribution, the Bernoulli distribution, which is a special case of the binomial distribution (the binomial()
family in glmmTMB
):
= glmmTMB(Signs ~ Age,
mod_gforces data = gforces, family = binomial())
summary(mod_gforces)
Family: binomial ( logit )
Formula: Signs ~ Age
Data: gforces
AIC BIC logLik deviance df.resid
12.4 12.6 -4.2 8.4 6
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.92086 2.64639 -1.104 0.270
Age 0.10569 0.08037 1.315 0.188
The effect of age was not statistically significant in this tiny dataset, but we can still plot the estimated regression line, which is now non-linear on the probability scale:
# Create a small ‘grid’ with possible ‘Age’ values
= data.frame(Age = seq(0, 60, length = 50))
d_grid
# Calculate estimated probabilities
$risk_est = predict(mod_gforces,
d_gridnewdata = d_grid, type = "response")
# Show the probabilities as a function of age
+
fig_gforces geom_line(data = d_grid, aes(y=risk_est),
colour = "blue", linewidth= 1)
The default link function for the binomial()
family is the logit: \(\mathrm{logit}(p) = \log \frac{p}{1-p}\). This fits a logistic regression model.
The inverse of the logit is sometimes called the expit. (In R, the logit function is called qlogis()
, and the expit plogis()
.)
After back-transformation to the probability scale using the expit, the relationship between the predictor and the estimated probability will always have an ‘S’ shape (or an upside-down ‘S’ shape), like in the above figure, with asymptotes at \(p=0\) and \(p=1\).
Probability and odds
If the odds of an outcome is 3:1 (or just 3), this means that it’s 3 times as likely that the outcome occurs as it is that it doesn’t. The probability of the outcome is then 0.75 (since \(\frac{0.75}{1-0.75}=\frac{0.75}{0.25} = 3\)). In general, we can convert odds to probabilities using the formula \(p = \frac{o}{o + 1}\), and probabilities to odds using the formula \(o = \frac{p}{1-p}\). Odds > 1 corresponds to a probability > 0.5.
Odds ratios in logistic regression
As explained earlier, the logistic regression model is a linear function of the logit of the probability, i.e., a linear function of the log odds,
\[g(\mu_i) = g(p_i) = \mathrm{logit}(p_i) = \log \frac{p_i}{1-p_i} = \log(o_i) = \pmb{X_i\beta} = \sum_{j=0}^p \beta_j X_{ij},\]
where p is the number of predictors and \(X_{ij}\) is the value of predictor \(j\) for subject \(i\) (\(X_{i0} = 1\), corresponding to the intercept coefficient). In our model for the effect of age on the risk of showing signs of losing consciousness, we have
\[\log(o_i) = \beta_0 + \beta_1\times \mathrm{age}_i.\]
In logistic regression models, we can interpret the coefficients \(\beta_j\) in terms of the log of the ratios of odds (the log odds ratio, log OR). For example, let \(o_2\) be the odds of showing signs of losing consciousness for a 43 year-old, and \(o_1\) be the corresponding odds for a 42-year-old. Then \(\log(o_2/o_1)\) is
\[\log(o_2/o_1) = log(o_2) - log(o_1) =\] \[(\beta_0 + \beta_1\times 43) - (\beta_0 + \beta_1\times 42) = 43\beta_1 - 42\beta_1 = \beta_1\]
So the odds ratio (OR) \(o_2/o_1\) is \(\exp(\beta_1)\). It’s easy to see that we would get the same OR for all pairs of subjects that differ by one year in age. The same interpretation of \(\exp(\beta_1)\) also holds if the model includes additional predictors (that have identical values within each subject pair).
This is the natural interpretation of the coefficients in logistic regression models:
\[\log(o_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dotsb\]
A coefficient of \(\beta\) for a predictor \(x\) means that the odds (of ‘success’) are multiplied by the OR-value \(\exp(\beta)\) for each unit increase in \(x\).
An OR > 1 (corresponding to a coefficient \(\beta>0\)) means that higher values of the predictor is associated with a higher odds and probability of ‘success’. An OR < 1 means the opposite.
Logistic regression vs. linear regression
Compare the above interpretation with the interpretation of coefficients in linear regression:
\[\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dotsb\]
For a coefficient \(\beta\) for a predictor \(x\), the mean outcome increases by \(\beta\) for each unit increase in \(x\).
For linear regression, the predictor is associated with an absolute increase (in mean); for logistic regression, the predictor is associated with a relative increase (in odds).
For non-logistic binomial regression models, i.e., binomial regression with a different link function than the logit, the coefficient interpretation is different (and more complicated).
Interpreting coefficients in our model
Recall the estimates from our model:
fixef(mod_gforces)$cond
(Intercept) Age
-2.9208644 0.1056919
When looking at the effect of age on the odds of showing signs of losing consciousness, each additional year increases the odds multiplicatively by exp(0.106) = 1.11. In other words, each additional year increases the odds by about 11%.
Odds versus probabilities
For low probabilities (e.g., \(p<0.10\)), the odds \(o = p/(1-p)\) are almost identical to the probability \(p\):
curve(x / (1 - x), 0, .3, xlab = "Probability", ylab = "Odds")
abline(0, 1, col = "maroon")
And then the odds ratio, OR, in a logistic regression approximates the probability ratio (often called the relative risk, RR). This is sometimes useful.
Logistic regression with more than one outcome per observation
We will now look at a dataset where each ‘observation’ represents multiple trials. We assume that conditional on the covariates, the outcome has a binomial distribution.
Recall from an introductory probability/statistics course that a variable has binomial distribution if it represents the number of ‘successes’ in n independent trials, each having the same probability of success, p. The binomial regression is a model for this probability.
We have data on 23 space shuttle launches. Rubber O-rings are used to seal joints in the shuttle, and each space shuttle had 6 primary O-rings. The dataset contains information on the number of damaged O-rings along with the ambient air temperature on the day of launch.
We will first convert the temperature from Fahrenheit to degrees Celsius (°C) and add a variable for the number of O-rings on each shuttle (6 for all shuttles).
data(shuttles)
$temp_c = (shuttles$Temp - 32) / 1.8
shuttles$n_rings = 6
shuttleshead(shuttles)
Temp Damaged temp_c n_rings
1 53 2 11.66667 6
2 57 1 13.88889 6
3 58 1 14.44444 6
4 63 1 17.22222 6
5 66 0 18.88889 6
6 67 0 19.44444 6
There seems to be an association between the ambient temperature and the number of damaged O-rings.
ggplot(shuttles, aes(x = temp_c, y = Damaged)) +
geom_point(size = 3)
For binomial data, the outcome variable must be expressed as cbind(n_success, n_failures)
in the glmmTMB()
call:
The shuttle – named Challenger – was launched. It broke apart after 73 seconds, due to O-ring damage. All seven crew members died.
(You can read more about the disaster on Wikipedia. And of course everything is more complicated than our simple analyses make it out to be.)
Overdispersion
The assumption of independence in the binomial trials is important. What effect does lack of independence have? Well, assume we have a positive dependence, so that one O-ring failing results in an increased risk of the other ones also failing. One extreme example would be if all six O-rings failed if one of them did. Then we would observe only either 0 or 6 failures, and we would have much greater variance than that predicted by the binomial distribution. We would have overdispersion.
Overdispersion results in confidence intervals that are too narrow and P-values that are too small (i.e., we would have too high type I error rates).