Module 7: Zero inflation

Author

Karl Ove Hufthammer

Introduction

Coral is a keen angler and goes fishing every day. Well, except if it’s raining; then she maybe stays at home. Or if she’s ill (which rarely happens).

We’ll simulate the number of fish she catches each day for a year and fit a zero-inflated model.

The parameters

In spring, she will on average catch 2 fish per hour spent fishing. In the summer, only 1, in the fall, 1.5 and in the winter, only 0.3.

Well, this is if she goes fishing in the morning. Midday, she’ll only catch on average ¼ the number of fish. On the other hand, in the evening, she’ll catch 60% more.

She’s targeting fish that are more active when it’s raining, so if she’s out fishing in the rain, she’ll double the (expected) number of fish she’ll catch. But when it’s raining, she actually prefers to stay at home. With probability 70%, she’ll stay home. On any day, she also has a 3% chance of getting ill, and then she’ll spend the day in bed.

The chance of rain also varies between seasons:

  • Spring: 10%
  • Summer: 15%
  • Fall: 40%
  • Winter: 30%

Simulating the data

Let’s set up some data frames for storing the various parameters and effects.

seasons = c("Spring", "Summer", "Fall", "Winter")
d_season = data.frame(
  season = seasons,
  mean_fish_season = c(2, 1, 1.5, 0.3)
)
times = c("Morning", "Midday", "Evening")
d_time = data.frame(
  time = times,
  time_effect_fish = c(1, 1 / 4, 1.6)
)
p_ill = 0.03
p_rain_spring = .1
p_home_if_raining = .7
rain_effect = 2
d_rain = data.frame(
  season = seasons,
  season_effect_rain = c(1, 1.5, 4, 3)
)

We’ll simulate about a year’s worth of data. We’ll assume that she chooses a random time of day to go fishing and that she fishes for 30 minutes to 2.5 hours each time she goes fishing.

(Since she simply loves fishing, and would rather spend the whole day on the sea, the time she actually spends is based on how much time she has available, not on the number of fish she catches.)

set.seed(100)
days_per_season = 90
d_days = data.frame(season = rep(seasons, each = days_per_season),
                   time = sample(times, 4*days_per_season, replace =TRUE),
                   time_spent = round(runif(4*days_per_season, .5, 2.5), 1))
library(dplyr)
d_exp = d_days %>% 
  left_join(d_season) %>% 
  left_join(d_time) %>% 
  left_join(d_rain)
d_sim = d_exp %>% 
  mutate(season=factor(season, levels=seasons),
         time=factor(time, levels=times),
         p_rain = p_rain_spring*season_effect_rain,
         raining = rbinom(nrow(.), 1, p_rain),
         rain_effect_fish = if_else(raining==1, rain_effect, 1),
         p_home = p_ill + (1-p_ill)*raining*p_home_if_raining,
         p_fishing = 1-p_home,
         fishing = rbinom(nrow(.), 1, p_fishing),
         mean_fish = mean_fish_season*time_effect_fish*rain_effect_fish*time_spent,
         n_fish = fishing*rpois(nrow(.), mean_fish)
         )
head(d_sim)
  season    time time_spent mean_fish_season time_effect_fish
1 Spring  Midday        0.7                2             0.25
2 Spring Evening        0.6                2             1.60
3 Spring  Midday        1.5                2             0.25
4 Spring Evening        0.6                2             1.60
5 Spring Morning        2.4                2             1.00
6 Spring  Midday        1.8                2             0.25
  season_effect_rain p_rain raining rain_effect_fish p_home p_fishing fishing
1                  1    0.1       0                1  0.030     0.970       1
2                  1    0.1       0                1  0.030     0.970       1
3                  1    0.1       1                2  0.709     0.291       0
4                  1    0.1       0                1  0.030     0.970       1
5                  1    0.1       0                1  0.030     0.970       1
6                  1    0.1       0                1  0.030     0.970       1
  mean_fish n_fish
1      0.35      0
2      1.92      3
3      1.50      0
4      1.92      1
5      4.80      3
6      0.90      0

A naïve Poisson model

Since the number of fish caught was simulated using the Poisson distribution, we might be tempted to fit a Poisson model:

library(glmmTMB)
mod_pois = glmmTMB(
  n_fish ~ season + time
    + raining + offset(log(time_spent)),
  family = poisson(), data = d_sim
)
summary(mod_pois)
 Family: poisson  ( log )
Formula:          n_fish ~ season + time + raining + offset(log(time_spent))
Data: d_sim

     AIC      BIC   logLik deviance df.resid 
  1080.5   1107.7   -533.2   1066.5      353 


Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.63285    0.09089   6.963 3.33e-12 ***
seasonSummer -0.53724    0.11155  -4.816 1.46e-06 ***
seasonFall   -0.28284    0.10507  -2.692  0.00711 ** 
seasonWinter -2.07011    0.19704 -10.506  < 2e-16 ***
timeMidday   -1.46368    0.16377  -8.938  < 2e-16 ***
timeEvening   0.56416    0.09311   6.059 1.37e-09 ***
raining      -0.50005    0.11455  -4.365 1.27e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But this isn’t a correct model. We have many more zeros (no fish caught) than expected from a Poisson distribution. They are for the days Coral spent at home instead of going fishing.

So we have a mix of structural zeros (stayed at home) and random zeros (went fishing but were unlucky):

count(d_sim, fishing, n_fish)
   fishing n_fish  n
1        0      0 78
2        1      0 97
3        1      1 70
4        1      2 33
5        1      3 23
6        1      4 16
7        1      5 11
8        1      6  9
9        1      7 10
10       1      8  4
11       1      9  1
12       1     10  3
13       1     11  2
14       1     12  1
15       1     13  2

With the extra zeros, we have more variances that would be expected from a Poisson distribution; we have overdispersion:

library(DHARMa)
testDispersion(mod_pois)


    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 1.9157, p-value < 2.2e-16
alternative hypothesis: two.sided

A better Poisson model

Of course, we know if she went fishing or not, so we can simply remove the stay-at-home days:

d_fishing = filter(d_sim, fishing == 1)
mod_pois_fishing = update(mod_pois, data = d_fishing)
summary(mod_pois_fishing)
 Family: poisson  ( log )
Formula:          n_fish ~ season + time + raining + offset(log(time_spent))
Data: d_fishing

     AIC      BIC   logLik deviance df.resid 
   814.1    839.6   -400.0    800.1      275 


Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.70544    0.09218   7.653 1.96e-14 ***
seasonSummer -0.53455    0.11145  -4.796 1.62e-06 ***
seasonFall   -0.34512    0.10967  -3.147  0.00165 ** 
seasonWinter -2.14109    0.19912 -10.752  < 2e-16 ***
timeMidday   -1.47082    0.16426  -8.954  < 2e-16 ***
timeEvening   0.52537    0.09493   5.534 3.13e-08 ***
raining       0.61991    0.11983   5.173 2.30e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testDispersion(mod_pois_fishing)


    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 1.0544, p-value = 0.64
alternative hypothesis: two.sided
Exercise 1

If Coral is fishing in rain, this should in theory double the expected number of fish she‘ll catch (compared to fishing when it’s not raining), i.e., increase the expected number of fish by a factor of 2. What is the estimate of this factor from the model.

A zero-inflated model

But what if Coral had forgotten to record whether she stayed at home or went fishing? Then we can use a zero-inflated model. This basically fits two models:

  • Zero-inflation model – a model for the probability (or odds) of having a structural zero (staying home)
  • Conditional model – a model for the outcome conditional on not having a structural zero (i.e., conditional on going fishing)

We can use the same predictors in both models or different predictors for the two models. The ziformula argument takes a one-sided formula for specifying the predictors in the zero-inflated (staying at home) part.

If we want to use the same predictors as in the conditional model, we can simply write ziformula = ~. (the offset() term is automatically dropped from the zero-inflation model):

mod_zi = glmmTMB(
  n_fish ~ season + time
    + raining + offset(log(time_spent)),
  ziformula = ~.,
  family = poisson(), data = d_sim
)
summary(mod_zi)
 Family: poisson  ( log )
Formula:          n_fish ~ season + time + raining + offset(log(time_spent))
Zero inflation:          ~.
Data: d_sim

     AIC      BIC   logLik deviance df.resid 
   935.3    989.7   -453.7    907.3      346 


Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.74433    0.09663   7.703 1.33e-14 ***
seasonSummer -0.54280    0.11505  -4.718 2.38e-06 ***
seasonFall   -0.36527    0.11407  -3.202  0.00136 ** 
seasonWinter -2.25310    0.22115 -10.188  < 2e-16 ***
timeMidday   -1.52500    0.17832  -8.552  < 2e-16 ***
timeEvening   0.49086    0.09942   4.937 7.92e-07 ***
raining       0.60214    0.12735   4.728 2.27e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Zero-inflation model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.9478     0.8240  -3.577 0.000347 ***
seasonSummer  -0.4181     0.9587  -0.436 0.662794    
seasonFall    -1.0851     0.9026  -1.202 0.229254    
seasonWinter  -1.6069     1.1866  -1.354 0.175663    
timeMidday    -0.2583     0.8500  -0.304 0.761238    
timeEvening   -0.3238     0.5918  -0.547 0.584294    
raining        4.6145     1.0012   4.609 4.05e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The zero-inflation part uses a logit link (this can’t be changed), so it corresponds to a logistic model (but where we actually don’t observe the outcome!). The coefficients can be interpreted as for a logistic model.

Exercise 2

We see that only the effect of raining is statistically significant. Based on the model, estimate the probability of staying at home on a spring morning:

  1. when it’s not raining
  2. when it’s raining

Hint: You can extract the coefficients using fixef(mod_zi)$zi.

(And now we finally understand why we had to use $cond when extracting coefficients from models in earlier exercises. We were extracting coefficients for a conditional model.)

A simpler zero-inflated model

Exercise 3

We know that only rain affects the chance of staying at home. Fit a model where the zero-inflated part only has raining as a predictor. Store the model using the name mod_zi_rain. We’ll use this model for the rest of the exercises.

(Use the same set of predictors as before for the conditional part of the model.)

Note

Note that even though we didn’t make any changes to the conditional part of the model, the coefficient estimates did (slightly) change.

Comparing nested models

The two models mod_zi and mod_zi_rain are nested (the former can be turned into the latter by setting some of the coefficients to zero). That means that we can use a likelihood ratio test to compare the models. We’re then testing if we need to include season and time (five dummy/indicator variables) or if a model including an intercept and raining is sufficient. We can use the (misleadingly named) anova() function to perform likelihood ratio tests on nested models:

anova(mod_zi, mod_zi_rain)
Data: d_sim
Models:
mod_zi_rain: n_fish ~ season + time + raining + offset(log(time_spent)), zi=~raining, disp=~1
mod_zi: n_fish ~ season + time + raining + offset(log(time_spent)), zi=~., disp=~1
            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
mod_zi_rain  9 928.74 963.72 -455.37   910.74                         
mod_zi      14 935.32 989.72 -453.66   907.32 3.4267      5     0.6345

The high P-value (0.63) tells us to not reject the simpler model where only raining has an effect on the probability of staying at home.

Estimating a few more parameters

Exercise 4

Recall the d_season object:

d_season
  season mean_fish_season
1 Spring              2.0
2 Summer              1.0
3   Fall              1.5
4 Winter              0.3

Use the coefficient estimates from your model to estimate the above numbers/parameters.

Estimating expected values

Coral plans to spend the summer fishing in the evening for two hours. If she’s unlucky, it’ll rain. We can estimate the expected number of fish caught for a rainy day:

d_plan = data.frame(
  season = "Summer",
  time = "Evening",
  raining = 1,
  time_spent = 2
)
predict(mod_zi_rain, newdata = d_plan, type = "response")
[1] 2.626541
Exercise 5
  1. Estimate the expected number of fish caught from the conditional model. You can use the same predict() call; just change the type to "conditional".

  2. Interpret the estimate. (What are we estimating?)

  3. Why is this number much higher than the number we got when we used type = "response"?

Exercise 6
  1. Estimate the probability of staying at home in this situation. You can use the same predict() call; just change the type to zprob.

  2. Multiply your estimate in the previous exercise with ‘1 minus the probability’ from this exercise. The result should look familiar. Where have you seen this estimate before?

  3. The result is an estimate of something. Look back to the introduction, where we defined various expected values and probabilities. What is the theoretical number that the result is an estimate of?

Exercise 7

Repeat the calculations for the estimated expected number of fish caught using a) the unconditional model (type = "response") and b) the conditional model (type = "conditional"), for the same season and time of day, but assuming that it’s not raining.

The two expected values are now much more similar. Briefly explain why.

The optimal time to go fishing

For all combinations of seasons, times of day and rain / not rain, we can easily estimate the expected number of fish that Coral will catch if she spends one hour fishing:

library(marginaleffects)
d_pred = predictions(mod_zi_rain,
  newdata = datagrid(
    season = unique,
    time = unique,
    raining = unique,
    time_spent = 1
  ),
  type = "response", vcov = FALSE
)
d_pred

 season    time raining time_spent Estimate
 Spring Morning       0          1   2.0162
 Spring Morning       1          1   1.3710
 Spring Midday        0          1   0.4420
 Spring Midday        1          1   0.3005
 Spring Evening       0          1   3.3251
 Spring Evening       1          1   2.2611
 Summer Morning       0          1   1.1710
 Summer Morning       1          1   0.7963
 Summer Midday        0          1   0.2567
 Summer Midday        1          1   0.1746
 Summer Evening       0          1   1.9313
 Summer Evening       1          1   1.3133
 Fall   Morning       0          1   1.4192
 Fall   Morning       1          1   0.9651
 Fall   Midday        0          1   0.3111
 Fall   Midday        1          1   0.2116
 Fall   Evening       0          1   2.3405
 Fall   Evening       1          1   1.5916
 Winter Morning       0          1   0.2307
 Winter Morning       1          1   0.1569
 Winter Midday        0          1   0.0506
 Winter Midday        1          1   0.0344
 Winter Evening       0          1   0.3805
 Winter Evening       1          1   0.2587

Columns: rowid, estimate, n_fish, season, time, raining, time_spent 
Type:  response 
Exercise 8
  1. When (which season and time of day) and in which weather should she go fishing if she wants to maximise the expected number of fish caught?

  2. What is the expected number of fish caught at this timepoint?

Exercise 9

Coral discovers that she actually enjoys fishing in the rain. And she thinks fishing is good for her health, so from now on, she’ll go fishing even if it’s raining and even if she’s feeling ill …

Answer questions a) and b) from the last exercise based on Coral’s new behaviour.

Hint: To correctly answer the questions, you should rerun predictions() but change one of the arguments.