Module 5: Binomial models

Author

Karl Ove Hufthammer

Introduction

In this module, we will look at data where the outcome variable can only take two values. The values are typically called ‘success’ and ‘failure’, though ‘success’ can indicate a bad outcome, e.g., death.

Let’s load the necessary packages:

library(glmmTMB)
library(GLMsData)
library(ggplot2)

Our first dataset is from a small experiment on the effect of oxygen deprivation, which we know can cause loss of consciousness. Type ?forces for details on the dataset.

We are interested in whether a subject’s age is related to the risk of showing signs of losing consciousness. The outcome is coded as 1 (showing signs) and 0 (not showing signs). As always, we first plot our data:

data(gforces)
fig_gforces = ggplot(gforces, aes(x = Age, y = Signs)) +
  geom_point(size = 3)
fig_gforces

There seems to be some (very weak) evidence of age having an effect. For example, the three oldest subjects all showed signs of losing consciousness during the experiment. (But the figure would be more convincing if the top-left observation was missing …)

Since the outcome variable (Signs) is coded 0/1, an estimate of the overall probability (risk) of showing signs of losing consciousness (for a population with this age distribution) is the mean of the outcome variable (why?):

mean(gforces$Signs) # ~63%

[1] 0.625

If the risk varies with age, we might be tempted to fit a simple linear regression model. The regression line would look like this:

fig_gforces + 
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

There are several problems with this approach:

The regression would predict a risk > 1 (which is impossible) for large values of age (e.g., for age > 55).
The observations are certainly not normally distributed.
We know that we have heteroscedasticity, since for a binary/Bernoulli variable with probability of ‘success’ p, the variance is p × (1 − p). So if p varies with age, so does the variance.

We can fix this by using the actual conditional distribution, the Bernoulli distribution, which is a special case of the binomial distribution (the binomial() family in glmmTMB):

mod_gforces = glmmTMB(Signs ~ Age,
                      data = gforces, family = binomial())
summary(mod_gforces)

 Family: binomial  ( logit )
Formula:          Signs ~ Age
Data: gforces

     AIC      BIC   logLik deviance df.resid 
    12.4     12.6     -4.2      8.4        6 


Conditional model:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.92086    2.64639  -1.104    0.270
Age          0.10569    0.08037   1.315    0.188

The effect of age was not statistically significant in this tiny dataset, but we can still plot the estimated regression line, which is now non-linear on the probability scale:

# Create a small ‘grid’ with possible ‘Age’ values
d_grid = data.frame(Age = seq(0, 60, length = 50))

# Calculate estimated probabilities
d_grid$risk_est = predict(mod_gforces,
                          newdata = d_grid, type = "response")

# Show the probabilities as a function of age
fig_gforces +
  geom_line(data = d_grid, aes(y=risk_est),
            colour = "blue", linewidth= 1)

The default link function for the binomial() family is the logit: \(\mathrm{logit}(p) = \log \frac{p}{1-p}\). This fits a logistic regression model.

The inverse of the logit is sometimes called the expit. (In R, the logit function is called qlogis(), and the expit plogis().)

After back-transformation to the probability scale using the expit, the relationship between the predictor and the estimated probability will always have an ‘S’ shape (or an upside-down ‘S’ shape), like in the above figure, with asymptotes at \(p=0\) and \(p=1\).

Probability and odds

If the odds of an outcome is 3:1 (or just 3), this means that it’s 3 times as likely that the outcome occurs as it is that it doesn’t. The probability of the outcome is then 0.75 (since \(\frac{0.75}{1-0.75}=\frac{0.75}{0.25} = 3\)). In general, we can convert odds to probabilities using the formula \(p = \frac{o}{o + 1}\), and probabilities to odds using the formula \(o = \frac{p}{1-p}\). Odds > 1 corresponds to a probability > 0.5.

Odds ratios in logistic regression

As explained earlier, the logistic regression model is a linear function of the logit of the probability, i.e., a linear function of the log odds,

\[g(\mu_i) = g(p_i) = \mathrm{logit}(p_i) = \log \frac{p_i}{1-p_i} = \log(o_i) = \pmb{X_i\beta} = \sum_{j=0}^p \beta_j X_{ij},\]

where p is the number of predictors and \(X_{ij}\) is the value of predictor \(j\) for subject \(i\) (\(X_{i0} = 1\), corresponding to the intercept coefficient). In our model for the effect of age on the risk of showing signs of losing consciousness, we have

\[\log(o_i) = \beta_0 + \beta_1\times \mathrm{age}_i.\]

In logistic regression models, we can interpret the coefficients \(\beta_j\) in terms of the log of the ratios of odds (the log odds ratio, log OR). For example, let \(o_2\) be the odds of showing signs of losing consciousness for a 43 year-old, and \(o_1\) be the corresponding odds for a 42-year-old. Then \(\log(o_2/o_1)\) is

\[\log(o_2/o_1) = log(o_2) - log(o_1) =\] \[(\beta_0 + \beta_1\times 43) - (\beta_0 + \beta_1\times 42) = 43\beta_1 - 42\beta_1 = \beta_1\]

So the odds ratio (OR) \(o_2/o_1\) is \(\exp(\beta_1)\). It’s easy to see that we would get the same OR for all pairs of subjects that differ by one year in age. The same interpretation of \(\exp(\beta_1)\) also holds if the model includes additional predictors (that have identical values within each subject pair).

This is the natural interpretation of the coefficients in logistic regression models:

\[\log(o_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dotsb\]

A coefficient of \(\beta\) for a predictor \(x\) means that the odds (of ‘success’) are multiplied by the OR-value \(\exp(\beta)\) for each unit increase in \(x\).

An OR > 1 (corresponding to a coefficient \(\beta>0\)) means that higher values of the predictor is associated with a higher odds and probability of ‘success’. An OR < 1 means the opposite.

Logistic regression vs. linear regression

Compare the above interpretation with the interpretation of coefficients in linear regression:

\[\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dotsb\]

For a coefficient \(\beta\) for a predictor \(x\), the mean outcome increases by \(\beta\) for each unit increase in \(x\).

For linear regression, the predictor is associated with an absolute increase (in mean); for logistic regression, the predictor is associated with a relative increase (in odds).

Note

For non-logistic binomial regression models, i.e., binomial regression with a different link function than the logit, the coefficient interpretation is different (and more complicated).

Interpreting coefficients in our model

Recall the estimates from our model:

fixef(mod_gforces)$cond

(Intercept)         Age 
 -2.9208644   0.1056919

When looking at the effect of age on the odds of showing signs of losing consciousness, each additional year increases the odds multiplicatively by exp(0.106) = 1.11. In other words, each additional year increases the odds by about 11%.

Exercise 1

Use the coefficients above to answer the following:

How would you interpret the intercept (or rather the exp() of the intercept) for this model?
What are the estimated odds of showing signs of losing consciousness for a 40-year-old?
What are the estimated odds of showing signs of losing consciousness for a 30-year-old?
Use the answer in the two previous questions to confirm that the odds ratio for a 40-year-old compared to a 30-year-old (a 10-year difference) is identical to \(\exp(10 \beta_1) = \exp(\beta_1)^{10}\).
What is the probability of showing signs of losing consciousness for a 40-year-old?

Odds versus probabilities

For low probabilities (e.g., \(p<0.10\)), the odds \(o = p/(1-p)\) are almost identical to the probability \(p\):

curve(x / (1 - x), 0, .3, xlab = "Probability", ylab = "Odds")
abline(0, 1, col = "maroon")

And then the odds ratio, OR, in a logistic regression approximates the probability ratio (often called the relative risk, RR). This is sometimes useful.

Logistic regression with more than one outcome per observation

We will now look at a dataset where each ‘observation’ represents multiple trials. We assume that conditional on the covariates, the outcome has a binomial distribution.

Recall from an introductory probability/statistics course that a variable has binomial distribution if it represents the number of ‘successes’ in n independent trials, each having the same probability of success, p. The binomial regression is a model for this probability.

We have data on 23 space shuttle launches. Rubber O-rings are used to seal joints in the shuttle, and each space shuttle had 6 primary O-rings. The dataset contains information on the number of damaged O-rings along with the ambient air temperature on the day of launch.

We will first convert the temperature from Fahrenheit to degrees Celsius (°C) and add a variable for the number of O-rings on each shuttle (6 for all shuttles).

data(shuttles)
shuttles$temp_c = (shuttles$Temp - 32) / 1.8
shuttles$n_rings = 6
head(shuttles)

  Temp Damaged   temp_c n_rings
1   53       2 11.66667       6
2   57       1 13.88889       6
3   58       1 14.44444       6
4   63       1 17.22222       6
5   66       0 18.88889       6
6   67       0 19.44444       6

There seems to be an association between the ambient temperature and the number of damaged O-rings.

ggplot(shuttles, aes(x = temp_c, y = Damaged)) +
  geom_point(size = 3)

For binomial data, the outcome variable must be expressed as cbind(n_success, n_failures) in the glmmTMB() call:

Exercise 2

Fit a logistic regression model for the probability/odds of an O-ring failure based on ambient air temperature in °C. The formula part of the glmmTMB() should be cbind(Damaged, n_rings - Damaged) ~ temp_c.

(Here the word ‘success’ is misleading, since we‘re talking about damage. But mathematically, it’s OK, so we’re happy. 😊)

What is the estimated relative increase or decrease in odds for a 1 °C increase in temperature? Express it both as an odds ratio and as a percentage value.
What is the estimated relative increase or decrease in odds for a 1 °C decrease in temperature? Express it both as an odds ratio and as a percentage value.
Confirm that the answer in b) is 1 divided by the answer in a). Make sure you understand why (but you don’t have to explain it).

Exercise 3

At the time of launch of the next space shuttle, the air temperature was projected to be −2 °C. Based on the coefficients from your model, what is the estimated probability of an O-ring failure (for a specific O-ring).

You can check your answer graphically using the following command (where mod_orings is the name of your variable):

library(marginaleffects)
plot_predictions(mod_orings,
  condition = list(temp_c = -5:30),
  type = "response", vcov = FALSE) +
  scale_y_continuous(limits = c(0, 1)) +
  geom_vline(xintercept = -2, colour = "red")

Exercise 4

Based on the model and the projected ambient air temperature, would you recommend launching the shuttle or postponing the launch?

The shuttle – named Challenger – was launched. It broke apart after 73 seconds, due to O-ring damage. All seven crew members died.

(You can read more about the disaster on Wikipedia. And of course everything is more complicated than our simple analyses make it out to be.)

Overdispersion

The assumption of independence in the binomial trials is important. What effect does lack of independence have? Well, assume we have a positive dependence, so that one O-ring failing results in an increased risk of the other ones also failing. One extreme example would be if all six O-rings failed if one of them did. Then we would observe only either 0 or 6 failures, and we would have much greater variance than that predicted by the binomial distribution. We would have overdispersion.

Overdispersion results in confidence intervals that are too narrow and P-values that are too small (i.e., we would have too high type I error rates).

Exercise 5

Use testDispersion() from the DHARMa package on your model. Are there signs of overdispersion?

Exercise 6

Pretend that we have the dependence structure mentioned above. Modify the dataset so that the number of damaged O-rings is listed as 6 if at least one O-ring was damaged and 0 otherwise. Refit your model and test for overdispersion. Are there now signs of overdispersion?