= 0:9
y = dpois(y, 2)
p library(ggplot2)
ggplot(mapping = aes(x = y, y = p)) +
geom_col() +
scale_x_continuous(breaks = y) +
theme(panel.grid = element_blank())
Module 4: Count data
Introduction
In this module, we will look at count data, with emphasis on the Poisson model.
We’ll simulate data from a Poisson distribution with mean 2. And just to give the numbers some ‘colour’, let’s pretend that they represent the number of accidents that drivers of electric scooters (in the Bergen dialect: ‘el-løperhjul’) experience in a year.
So ‘on average’, a driver will be involved in 2 accidents each year. The actual number of accidents can of course be 0 or 1 or 2 or 3 or …, and the probability distribution looks like this:
Simulated Poisson data
We’ll simulate the number of accidents in a year for 500 riders of electric scooters:
set.seed(1915)
= rpois(500, 2) n_accidents
The sample mean and variance are:
mean(n_accidents)
[1] 1.934
var(n_accidents)
[1] 2.037719
Note that the two numbers are very similar. This is a general feature of the Poisson distribution: the population mean equals the population variance.
Fitting an intercept-only Poisson model
We can also estimate the mean (and thereby the variance) by fitting a Poisson regression model with just an intercept (we now use the simulated data with mean 2):
library(glmmTMB)
= data.frame(n_accidents)
dat_pois = glmmTMB(n_accidents ~ 1,
mod_pois data = dat_pois, family = poisson())
summary(mod_pois)
Family: poisson ( log )
Formula: n_accidents ~ 1
Data: dat_pois
AIC BIC logLik deviance df.resid
1716.4 1720.6 -857.2 1714.4 499
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.65959 0.03216 20.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(fixef(mod_pois)$cond) # Mean and variance estimate
(Intercept)
1.934
Overdispersion and underdispersion
Since the variance is supposed to equal the mean, we don’t get a separate estimate of the variance (like we did with the linear models in module 2). One method of testing whether the Poisson distribution fits the data is to compare the sample variance with the sample mean. If the variance is much higher than the mean, this indicates that we have more variation than we would expect from the Poisson distribution; we have overdispersion. If it’s much lower than the mean, we have underdispersion (a much rarer phenomenon). In both cases, the Poisson model would not fit the data well.
However, if we have more than one predictor, testing for over- and underdispersion is more complicated, as we have to compare the conditional variances (conditional on the predictors) with the conditional means.
One general method, which also works for other distributions than the Poisson, is to compare the sample variance of the residuals with what we would expect the variance to be if the fitted model was correct. To do this, we simulate many (here Poisson-distributed) random variables from the fitted model, i.e., using the estimated mean structure (where the parameters are replaced by their estimates). Then we compare the actual variance of the residuals with the distribution of corresponding variances from the simulation.
The DHARMa
package can do this for us automatically (note that it uses a scaled version of the variances):
library(DHARMa)
This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
testDispersion(mod_pois) # 250 simulations by default
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.0523, p-value = 0.408
alternative hypothesis: two.sided
The red line is the (scaled) variance from the model, while the histogram shows the estimated distribution of this variance if the model was correct. The red line is well within the bulk of the distribution (also reflected by the P-value), so there’s little evidence of overdispersion.
Simulated data with overdispersion
Let’s now look at some data that have overdispersion. We let the expected number of accidents in a year differ between riders. This could be the case if some riders have more risky behaviour and others are more risk-averse (or simply are better at steering the e-scooter). However, the mean number of accidents across riders is still 2.
We use the gamma distribution to simulate the 500 mean/expected number of accidents (one for each rider):
set.seed(123)
= rgamma(500, 1.5, 1.5 / 2)
accident_means ggplot(mapping = aes(x = accident_means)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
And then we simulate the actual number of accidents experienced by each rider:
= data.frame(n_accidents = rpois(500, accident_means)) dat
The average number of accidents is still around 2 (the expected number of accidents is exactly 2), but the variance is much larger; we have overdispersion:
mean(dat$n_accidents)
[1] 1.938
var(dat$n_accidents)
[1] 4.715587
We still get a sensible estimate of the mean using the Poisson distribution:
= glmmTMB(n_accidents ~ 1,
mod family = poisson(), data = dat)
summary(mod)
Family: poisson ( log )
Formula: n_accidents ~ 1
Data: dat
AIC BIC logLik deviance df.resid
2102.4 2106.6 -1050.2 2100.4 499
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.66166 0.03212 20.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(fixef(mod)$cond) # Exactly equal to the sample mean
(Intercept)
1.938
But the dispersion clearly test shows us that the Poisson model does not fit the data well:
testDispersion(mod)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 2.4305, p-value < 2.2e-16
alternative hypothesis: two.sided
What this means in practice, is that we can’t trust P-values and confidence intervals from the model. The confidence intervals will be too narrow, and the P-values will be too small.
A negative binomial model
For data that are overdispersed (compared to the Poisson distribution), a distribution that often works well is the negative binomial distribution. It uses the same default link function as the Poisson.
= glmmTMB(n_accidents ~ 1,
mod_nbin family = nbinom2(), data = dat)
summary(mod_nbin)
Family: nbinom2 ( log )
Formula: n_accidents ~ 1
Data: dat
AIC BIC logLik deviance df.resid
1879.9 1888.3 -937.9 1875.9 498
Dispersion parameter for nbinom2 family (): 1.4
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.66166 0.04962 13.34 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that a Dispersion parameter
line has now appeared. Let’s do a dispersion test.
testDispersion(mod_nbin)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.0194, p-value = 0.84
alternative hypothesis: two.sided
When 1) a random variable λ has a gamma distribution, and 2) conditional on λ, another random variable Y has a Poisson distribution with mean λ, then the unconditional distribution of Y has a negative binomial distribution. The mean of Y will be equal to the mean of λ, but the variance will be greater than this mean.
This is exactly the situation our data was sampled from, so the negative binomial distribution should fit our data.
The negative binomial distribution can be parametrised in many ways. The glmmTMB
package implements two different parametrisations, nbinom1()
and nbinom2()
. See the slides for this module for information on the parameters involved.
Overdispersion because of missing predictors
Overdispersion in a Poisson model can also indicate that there are important predictors that we have left out of the model. So adding additional predictors can reduce or even eliminate the overdispersion.
Assume we have a ‘risk score’ for each rider (e.g., based on historical data). Here is an example of a (very good!) risk score:
$risk_score = log(accident_means) / 5 dat
Fit a Poisson model with this risk score as a predictor (and of course also include the default intercept). Try to understand why the two coefficient estimates and their corresponding P-values are what they are.
Taking exposure / unequal sampling into account
Another reason that the mean number of accidents varies between riders is that the amount of time they spend riding an electric scooter (i.e., their risk ‘exposure’) varies. So instead of looking at the number of accidents, we should look at the number of accidents per hour (or per day, per week or per year) spent riding an e-scooter. That is, we should look at a rate instead of a count.
For simplicity, let’s assume that the rate is the same for each rider, 2 per hundred hours riding an e-scooter (2 per ‘ride-hours’). (So if you ride a scooter for a hundred hours each year, you will only average experience 2 accidents per year.)
The number of ride-hours varies between riders, from 5 to 500 (in our model, we’ll use the number of hundreds of hours, or ‘hectohours’):
= runif(500, 5, 500)
cycle_hours = cycle_hours / 100 # ‘hectohours’ cycle_hhours
(We here assume that the number of hours is uniformly distributed. This isn’t very realistic, but the distribution used is not important at all.)
We simulate the actual number of accidents:
set.seed(100)
= 2
accident_rate = accident_rate * cycle_hhours
accident_means = rpois(500, accident_means)
y = data.frame(n_accidents = y, time = cycle_hhours) dat
Some people experience very many accidents:
plot(table(dat$n_accidents))
The mean number of accidents for person i is
\[\mu_i = \lambda × \text{number of hundred ride-hours for person $i$} = 2 × t_i.\]
The Poisson regression models the log-mean,
\[\log(\mu_i) = \log(\lambda × t_i) = \log(\lambda) + \log(t_i).\]
There’s no i on the λ since we assume that the accident rate is the same for everyone in this model. But in general, we will have other predictors for the rate, and the \(\log(\lambda)\) will be \(\log(\lambda_i)\) and equal to a linear predictor (\(\pmb{X_i b}\)).
There’s no coefficient in front of \(\log(t_i)\); we assume that the number of accidents is directly proportional to the time spent riding the e-scooter – the exposure time.
To take the exposure time into account, we use the offset
argument in glmmTMB()
:
= glmmTMB(n_accidents ~ 1,
mod_exposure offset = log(time),
family = poisson(), data = dat
)summary(mod_exposure)
Family: poisson ( log )
Formula: n_accidents ~ 1
Data: dat
Offset: log(time)
AIC BIC logLik deviance df.resid
2045.0 2049.2 -1021.5 2043.0 499
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.71125 0.01973 36.06 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is no evidence of overdispersion:
testDispersion(mod_exposure)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.95298, p-value = 0.488
alternative hypothesis: two.sided
Other models for over- and underdispersion
The negative binomial distribution only fits overdispersed data. For data that are either over- or underdispersed, we can use the Conway–Maxwell–Poisson model (compois()
family). See the slides for information on this model.
Real data – lung cancer in Denmark
The dataset danishlc
contains information on the number of new lung cancer cases in four cities in Denmark during the period 1968–1971, stratified by age group. It also includes information on the number of inhabitants in each city / age group (the Pop
variable). To avoid small numbers, we’ll express all incidence rates per 1,000 inhabitants:
library(GLMsData)
data(danishlc)
$Pop_1000 = danishlc$Pop / 1000
danishlc danishlc
Cases Pop Age City Pop_1000
1 11 3059 40-54 Fredericia 3.059
2 11 800 55-59 Fredericia 0.800
3 11 710 60-64 Fredericia 0.710
4 10 581 65-69 Fredericia 0.581
5 11 509 70-74 Fredericia 0.509
6 10 605 >74 Fredericia 0.605
7 13 2879 40-54 Horsens 2.879
8 6 1083 55-59 Horsens 1.083
9 15 923 60-64 Horsens 0.923
10 10 834 65-69 Horsens 0.834
11 12 634 70-74 Horsens 0.634
12 2 782 >74 Horsens 0.782
13 4 3142 40-54 Kolding 3.142
14 8 1050 55-59 Kolding 1.050
15 7 895 60-64 Kolding 0.895
16 11 702 65-69 Kolding 0.702
17 9 535 70-74 Kolding 0.535
18 12 659 >74 Kolding 0.659
19 5 2520 40-54 Vejle 2.520
20 7 878 55-59 Vejle 0.878
21 10 839 60-64 Vejle 0.839
22 14 631 65-69 Vejle 0.631
23 8 539 70-74 Vejle 0.539
24 7 619 >74 Vejle 0.619
Fit a Poisson model for the number of cases. Only include an intercept term and an offset (use Pop_1000
instead of Pop
as the population variable).
Now add age and city as additional predictors (don’t include any interactions).
The summary()
of the model contains coefficients for each city and age group (except for the reference city / age group). To test if there are differences between cities (age groups), we can use a likelihood ratio test to compare our ‘full’ model with a model that doesn’t include city (age group) as a predictor. It’s easiest to use the drop1()
function for this:
drop1(mod_lung, test = "Chisq")