Module 4: Count data

Author

Karl Ove Hufthammer

Introduction

In this module, we will look at count data, with emphasis on the Poisson model.

We’ll simulate data from a Poisson distribution with mean 2. And just to give the numbers some ‘colour’, let’s pretend that they represent the number of accidents that drivers of electric scooters (in the Bergen dialect: ‘el-løperhjul’) experience in a year.

So ‘on average’, a driver will be involved in 2 accidents each year. The actual number of accidents can of course be 0 or 1 or 2 or 3 or …, and the probability distribution looks like this:

y = 0:9
p = dpois(y, 2)
library(ggplot2)
ggplot(mapping = aes(x = y, y = p)) +
  geom_col() +
  scale_x_continuous(breaks = y) +
  theme(panel.grid = element_blank())

Simulated Poisson data

We’ll simulate the number of accidents in a year for 500 riders of electric scooters:

set.seed(1915)
n_accidents = rpois(500, 2)

The sample mean and variance are:

mean(n_accidents)

[1] 1.934

var(n_accidents)

[1] 2.037719

Note that the two numbers are very similar. This is a general feature of the Poisson distribution: the population mean equals the population variance.

Fitting an intercept-only Poisson model

We can also estimate the mean (and thereby the variance) by fitting a Poisson regression model with just an intercept (we now use the simulated data with mean 2):

library(glmmTMB)
dat_pois = data.frame(n_accidents)
mod_pois = glmmTMB(n_accidents ~ 1,
                   data = dat_pois, family = poisson())
summary(mod_pois)

 Family: poisson  ( log )
Formula:          n_accidents ~ 1
Data: dat_pois

     AIC      BIC   logLik deviance df.resid 
  1716.4   1720.6   -857.2   1714.4      499 


Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.65959    0.03216   20.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

exp(fixef(mod_pois)$cond) # Mean and variance estimate

(Intercept) 
      1.934

Overdispersion and underdispersion

Since the variance is supposed to equal the mean, we don’t get a separate estimate of the variance (like we did with the linear models in module 2). One method of testing whether the Poisson distribution fits the data is to compare the sample variance with the sample mean. If the variance is much higher than the mean, this indicates that we have more variation than we would expect from the Poisson distribution; we have overdispersion. If it’s much lower than the mean, we have underdispersion (a much rarer phenomenon). In both cases, the Poisson model would not fit the data well.

However, if we have more than one predictor, testing for over- and underdispersion is more complicated, as we have to compare the conditional variances (conditional on the predictors) with the conditional means.

One general method, which also works for other distributions than the Poisson, is to compare the sample variance of the residuals with what we would expect the variance to be if the fitted model was correct. To do this, we simulate many (here Poisson-distributed) random variables from the fitted model, i.e., using the estimated mean structure (where the parameters are replaced by their estimates). Then we compare the actual variance of the residuals with the distribution of corresponding variances from the simulation.

The DHARMa package can do this for us automatically (note that it uses a scaled version of the variances):

library(DHARMa)

This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')

testDispersion(mod_pois) # 250 simulations by default


    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 1.0523, p-value = 0.408
alternative hypothesis: two.sided

The red line is the (scaled) variance from the model, while the histogram shows the estimated distribution of this variance if the model was correct. The red line is well within the bulk of the distribution (also reflected by the P-value), so there’s little evidence of overdispersion.

Simulated data with overdispersion

Let’s now look at some data that have overdispersion. We let the expected number of accidents in a year differ between riders. This could be the case if some riders have more risky behaviour and others are more risk-averse (or simply are better at steering the e-scooter). However, the mean number of accidents across riders is still 2.

We use the gamma distribution to simulate the 500 mean/expected number of accidents (one for each rider):

set.seed(123)
accident_means = rgamma(500, 1.5, 1.5 / 2)
ggplot(mapping = aes(x = accident_means)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

And then we simulate the actual number of accidents experienced by each rider:

dat = data.frame(n_accidents = rpois(500, accident_means))

The average number of accidents is still around 2 (the expected number of accidents is exactly 2), but the variance is much larger; we have overdispersion:

mean(dat$n_accidents)

[1] 1.938

var(dat$n_accidents)

[1] 4.715587

We still get a sensible estimate of the mean using the Poisson distribution:

mod = glmmTMB(n_accidents ~ 1,
              family = poisson(), data = dat)
summary(mod)

 Family: poisson  ( log )
Formula:          n_accidents ~ 1
Data: dat

     AIC      BIC   logLik deviance df.resid 
  2102.4   2106.6  -1050.2   2100.4      499 


Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.66166    0.03212    20.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

exp(fixef(mod)$cond) # Exactly equal to the sample mean

(Intercept) 
      1.938

But the dispersion clearly test shows us that the Poisson model does not fit the data well:

testDispersion(mod)


    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 2.4305, p-value < 2.2e-16
alternative hypothesis: two.sided

What this means in practice, is that we can’t trust P-values and confidence intervals from the model. The confidence intervals will be too narrow, and the P-values will be too small.

A negative binomial model

For data that are overdispersed (compared to the Poisson distribution), a distribution that often works well is the negative binomial distribution. It uses the same default link function as the Poisson.

mod_nbin = glmmTMB(n_accidents ~ 1,
                   family = nbinom2(), data = dat)
summary(mod_nbin)

 Family: nbinom2  ( log )
Formula:          n_accidents ~ 1
Data: dat

     AIC      BIC   logLik deviance df.resid 
  1879.9   1888.3   -937.9   1875.9      498 


Dispersion parameter for nbinom2 family ():  1.4 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.66166    0.04962   13.34   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that a Dispersion parameter line has now appeared. Let’s do a dispersion test.

testDispersion(mod_nbin)


    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 1.0194, p-value = 0.84
alternative hypothesis: two.sided

Exercise 1

Look at the figure and P-value from the dispersion test for the negative binomial model. Are there any indications of overdispersion or underdispersion?

Exercise 2

Compare the standard errors (Std. Error) for the (Intercept) coefficient in the model summaries of the two models (Poisson model and negative binomial model on the same data). One is larger than the other.

Try to explain why this is consistent with what you know about the (over)dispersion in the two models. Which standard error estimate do you trust?

Note

When 1) a random variable λ has a gamma distribution, and 2) conditional on λ, another random variable Y has a Poisson distribution with mean λ, then the unconditional distribution of Y has a negative binomial distribution. The mean of Y will be equal to the mean of λ, but the variance will be greater than this mean.

This is exactly the situation our data was sampled from, so the negative binomial distribution should fit our data.

The negative binomial distribution can be parametrised in many ways. The glmmTMB package implements two different parametrisations, nbinom1() and nbinom2(). See the slides for this module for information on the parameters involved.

Overdispersion because of missing predictors

Overdispersion in a Poisson model can also indicate that there are important predictors that we have left out of the model. So adding additional predictors can reduce or even eliminate the overdispersion.

Assume we have a ‘risk score’ for each rider (e.g., based on historical data). Here is an example of a (very good!) risk score:

dat$risk_score = log(accident_means) / 5

Fit a Poisson model with this risk score as a predictor (and of course also include the default intercept). Try to understand why the two coefficient estimates and their corresponding P-values are what they are.

Exercise 3

Perform a dispersion test on your fitted model. Are there any indications of overdispersion?

Taking exposure / unequal sampling into account

Another reason that the mean number of accidents varies between riders is that the amount of time they spend riding an electric scooter (i.e., their risk ‘exposure’) varies. So instead of looking at the number of accidents, we should look at the number of accidents per hour (or per day, per week or per year) spent riding an e-scooter. That is, we should look at a rate instead of a count.

For simplicity, let’s assume that the rate is the same for each rider, 2 per hundred hours riding an e-scooter (2 per ‘ride-hours’). (So if you ride a scooter for a hundred hours each year, you will only average experience 2 accidents per year.)

The number of ride-hours varies between riders, from 5 to 500 (in our model, we’ll use the number of hundreds of hours, or ‘hectohours’):

cycle_hours = runif(500, 5, 500)
cycle_hhours = cycle_hours / 100 # ‘hectohours’

(We here assume that the number of hours is uniformly distributed. This isn’t very realistic, but the distribution used is not important at all.)

We simulate the actual number of accidents:

set.seed(100)
accident_rate = 2
accident_means = accident_rate * cycle_hhours
y = rpois(500, accident_means)
dat = data.frame(n_accidents = y, time = cycle_hhours)

Some people experience very many accidents:

plot(table(dat$n_accidents))

The mean number of accidents for person i is

\[\mu_i = \lambda × \text{number of hundred ride-hours for person $i$} = 2 × t_i.\]

The Poisson regression models the log-mean,

\[\log(\mu_i) = \log(\lambda × t_i) = \log(\lambda) + \log(t_i).\]

Note

There’s no i on the λ since we assume that the accident rate is the same for everyone in this model. But in general, we will have other predictors for the rate, and the $\log(\lambda)$ will be $\log(\lambda_i)$ and equal to a linear predictor ($\pmb{X_i b}$).

There’s no coefficient in front of $\log(t_i)$; we assume that the number of accidents is directly proportional to the time spent riding the e-scooter – the exposure time.

To take the exposure time into account, we use the offset argument in glmmTMB():

mod_exposure = glmmTMB(n_accidents ~ 1,
  offset = log(time),
  family = poisson(), data = dat
)
summary(mod_exposure)

 Family: poisson  ( log )
Formula:          n_accidents ~ 1
Data: dat
 Offset: log(time)

     AIC      BIC   logLik deviance df.resid 
  2045.0   2049.2  -1021.5   2043.0      499 


Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.71125    0.01973   36.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is no evidence of overdispersion:

testDispersion(mod_exposure)


    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 0.95298, p-value = 0.488
alternative hypothesis: two.sided

Exercise 4

Use the fitted model to estimate the accident rate (mean number of accidents per 100 ride-hours). Also calculate a 95% confidence intervals for the rate.

Other models for over- and underdispersion

The negative binomial distribution only fits overdispersed data. For data that are either over- or underdispersed, we can use the Conway–Maxwell–Poisson model (compois() family). See the slides for information on this model.

Real data – lung cancer in Denmark

The dataset danishlc contains information on the number of new lung cancer cases in four cities in Denmark during the period 1968–1971, stratified by age group. It also includes information on the number of inhabitants in each city / age group (the Pop variable). To avoid small numbers, we’ll express all incidence rates per 1,000 inhabitants:

library(GLMsData)
data(danishlc)
danishlc$Pop_1000 = danishlc$Pop / 1000
danishlc

   Cases  Pop   Age       City Pop_1000
1     11 3059 40-54 Fredericia    3.059
2     11  800 55-59 Fredericia    0.800
3     11  710 60-64 Fredericia    0.710
4     10  581 65-69 Fredericia    0.581
5     11  509 70-74 Fredericia    0.509
6     10  605   >74 Fredericia    0.605
7     13 2879 40-54    Horsens    2.879
8      6 1083 55-59    Horsens    1.083
9     15  923 60-64    Horsens    0.923
10    10  834 65-69    Horsens    0.834
11    12  634 70-74    Horsens    0.634
12     2  782   >74    Horsens    0.782
13     4 3142 40-54    Kolding    3.142
14     8 1050 55-59    Kolding    1.050
15     7  895 60-64    Kolding    0.895
16    11  702 65-69    Kolding    0.702
17     9  535 70-74    Kolding    0.535
18    12  659   >74    Kolding    0.659
19     5 2520 40-54      Vejle    2.520
20     7  878 55-59      Vejle    0.878
21    10  839 60-64      Vejle    0.839
22    14  631 65-69      Vejle    0.631
23     8  539 70-74      Vejle    0.539
24     7  619   >74      Vejle    0.619

Fit a Poisson model for the number of cases. Only include an intercept term and an offset (use Pop_1000 instead of Pop as the population variable).

Exercise 5

Perform a dispersion test. Does the Poisson model seem to fit the data?

Now add age and city as additional predictors (don’t include any interactions).

Exercise 6

Perform a dispersion test for this new model. Does the model seem to fit the data?

Exercise 7

Calculate the exp() of the intercept coefficient estimate. What does this number represent?

(Recall that the reference level for a categorical/factor variable is the first level. You can use the level() function on a factor variable to get the list of levels.)

Exercise 8

Use the estimated model coefficients to calculate the model-based estimated lung cancer incidence rate (per 1,000) for 60–64 year-olds living in Vejle (for the time period used in the dataset).

Use the estimated model coefficients to calculate the model-based estimated expected number of new lung cancer cases for the same subgroup. (If the model is good, this number should be similar to the actual number of lung cancer cases.)

The summary() of the model contains coefficients for each city and age group (except for the reference city / age group). To test if there are differences between cities (age groups), we can use a likelihood ratio test to compare our ‘full’ model with a model that doesn’t include city (age group) as a predictor. It’s easiest to use the drop1() function for this:

drop1(mod_lung, test = "Chisq")

Exercise 9

Run likelihood ratio tests to test if the cancer incidence rate varies between cities and/or age groups. (Use the conventional significance level of 0.05 when interpreting the P-values.)

Does it look like the rate varies between cities?
Does it look like it varies between age groups?
Based on the coefficient estimates from the summary() output, which age group seems to have the highest cancer incidence rate?