= c("Spring", "Summer", "Fall", "Winter")
seasons = data.frame(
d_season season = seasons,
mean_fish_season = c(2, 1, 1.5, 0.3)
)= c("Morning", "Midday", "Evening")
times = data.frame(
d_time time = times,
time_effect_fish = c(1, 1 / 4, 1.6)
)= 0.03
p_ill = .1
p_rain_spring = .7
p_home_if_raining = 2
rain_effect = data.frame(
d_rain season = seasons,
season_effect_rain = c(1, 1.5, 4, 3)
)
Module 7: Zero inflation
Introduction
Coral is a keen angler and goes fishing every day. Well, except if it’s raining; then she maybe stays at home. Or if she’s ill (which rarely happens).
We’ll simulate the number of fish she catches each day for a year and fit a zero-inflated model.
The parameters
In spring, she will on average catch 2 fish per hour spent fishing. In the summer, only 1, in the fall, 1.5 and in the winter, only 0.3.
Well, this is if she goes fishing in the morning. Midday, she’ll only catch on average ¼ the number of fish. On the other hand, in the evening, she’ll catch 60% more.
She’s targeting fish that are more active when it’s raining, so if she’s out fishing in the rain, she’ll double the (expected) number of fish she’ll catch. But when it’s raining, she actually prefers to stay at home. With probability 70%, she’ll stay home. On any day, she also has a 3% chance of getting ill, and then she’ll spend the day in bed.
The chance of rain also varies between seasons:
- Spring: 10%
- Summer: 15%
- Fall: 40%
- Winter: 30%
Simulating the data
Let’s set up some data frames for storing the various parameters and effects.
We’ll simulate about a year’s worth of data. We’ll assume that she chooses a random time of day to go fishing and that she fishes for 30 minutes to 2.5 hours each time she goes fishing.
(Since she simply loves fishing, and would rather spend the whole day on the sea, the time she actually spends is based on how much time she has available, not on the number of fish she catches.)
set.seed(100)
= 90
days_per_season = data.frame(season = rep(seasons, each = days_per_season),
d_days time = sample(times, 4*days_per_season, replace =TRUE),
time_spent = round(runif(4*days_per_season, .5, 2.5), 1))
library(dplyr)
= d_days %>%
d_exp left_join(d_season) %>%
left_join(d_time) %>%
left_join(d_rain)
= d_exp %>%
d_sim mutate(season=factor(season, levels=seasons),
time=factor(time, levels=times),
p_rain = p_rain_spring*season_effect_rain,
raining = rbinom(nrow(.), 1, p_rain),
rain_effect_fish = if_else(raining==1, rain_effect, 1),
p_home = p_ill + (1-p_ill)*raining*p_home_if_raining,
p_fishing = 1-p_home,
fishing = rbinom(nrow(.), 1, p_fishing),
mean_fish = mean_fish_season*time_effect_fish*rain_effect_fish*time_spent,
n_fish = fishing*rpois(nrow(.), mean_fish)
)head(d_sim)
season time time_spent mean_fish_season time_effect_fish
1 Spring Midday 0.7 2 0.25
2 Spring Evening 0.6 2 1.60
3 Spring Midday 1.5 2 0.25
4 Spring Evening 0.6 2 1.60
5 Spring Morning 2.4 2 1.00
6 Spring Midday 1.8 2 0.25
season_effect_rain p_rain raining rain_effect_fish p_home p_fishing fishing
1 1 0.1 0 1 0.030 0.970 1
2 1 0.1 0 1 0.030 0.970 1
3 1 0.1 1 2 0.709 0.291 0
4 1 0.1 0 1 0.030 0.970 1
5 1 0.1 0 1 0.030 0.970 1
6 1 0.1 0 1 0.030 0.970 1
mean_fish n_fish
1 0.35 0
2 1.92 3
3 1.50 0
4 1.92 1
5 4.80 3
6 0.90 0
A naïve Poisson model
Since the number of fish caught was simulated using the Poisson distribution, we might be tempted to fit a Poisson model:
library(glmmTMB)
= glmmTMB(
mod_pois ~ season + time
n_fish + raining + offset(log(time_spent)),
family = poisson(), data = d_sim
)summary(mod_pois)
Family: poisson ( log )
Formula: n_fish ~ season + time + raining + offset(log(time_spent))
Data: d_sim
AIC BIC logLik deviance df.resid
1080.5 1107.7 -533.2 1066.5 353
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.63285 0.09089 6.963 3.33e-12 ***
seasonSummer -0.53724 0.11155 -4.816 1.46e-06 ***
seasonFall -0.28284 0.10507 -2.692 0.00711 **
seasonWinter -2.07011 0.19704 -10.506 < 2e-16 ***
timeMidday -1.46368 0.16377 -8.938 < 2e-16 ***
timeEvening 0.56416 0.09311 6.059 1.37e-09 ***
raining -0.50005 0.11455 -4.365 1.27e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But this isn’t a correct model. We have many more zeros (no fish caught) than expected from a Poisson distribution. They are for the days Coral spent at home instead of going fishing.
So we have a mix of structural zeros (stayed at home) and random zeros (went fishing but were unlucky):
count(d_sim, fishing, n_fish)
fishing n_fish n
1 0 0 78
2 1 0 97
3 1 1 70
4 1 2 33
5 1 3 23
6 1 4 16
7 1 5 11
8 1 6 9
9 1 7 10
10 1 8 4
11 1 9 1
12 1 10 3
13 1 11 2
14 1 12 1
15 1 13 2
With the extra zeros, we have more variances that would be expected from a Poisson distribution; we have overdispersion:
library(DHARMa)
testDispersion(mod_pois)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.9157, p-value < 2.2e-16
alternative hypothesis: two.sided
A better Poisson model
Of course, we know if she went fishing or not, so we can simply remove the stay-at-home days:
= filter(d_sim, fishing == 1)
d_fishing = update(mod_pois, data = d_fishing)
mod_pois_fishing summary(mod_pois_fishing)
Family: poisson ( log )
Formula: n_fish ~ season + time + raining + offset(log(time_spent))
Data: d_fishing
AIC BIC logLik deviance df.resid
814.1 839.6 -400.0 800.1 275
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.70544 0.09218 7.653 1.96e-14 ***
seasonSummer -0.53455 0.11145 -4.796 1.62e-06 ***
seasonFall -0.34512 0.10967 -3.147 0.00165 **
seasonWinter -2.14109 0.19912 -10.752 < 2e-16 ***
timeMidday -1.47082 0.16426 -8.954 < 2e-16 ***
timeEvening 0.52537 0.09493 5.534 3.13e-08 ***
raining 0.61991 0.11983 5.173 2.30e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testDispersion(mod_pois_fishing)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.0544, p-value = 0.64
alternative hypothesis: two.sided
A zero-inflated model
But what if Coral had forgotten to record whether she stayed at home or went fishing? Then we can use a zero-inflated model. This basically fits two models:
- Zero-inflation model – a model for the probability (or odds) of having a structural zero (staying home)
- Conditional model – a model for the outcome conditional on not having a structural zero (i.e., conditional on going fishing)
We can use the same predictors in both models or different predictors for the two models. The ziformula
argument takes a one-sided formula for specifying the predictors in the zero-inflated (staying at home) part.
If we want to use the same predictors as in the conditional model, we can simply write ziformula = ~.
(the offset()
term is automatically dropped from the zero-inflation model):
= glmmTMB(
mod_zi ~ season + time
n_fish + raining + offset(log(time_spent)),
ziformula = ~.,
family = poisson(), data = d_sim
)summary(mod_zi)
Family: poisson ( log )
Formula: n_fish ~ season + time + raining + offset(log(time_spent))
Zero inflation: ~.
Data: d_sim
AIC BIC logLik deviance df.resid
935.3 989.7 -453.7 907.3 346
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.74433 0.09663 7.703 1.33e-14 ***
seasonSummer -0.54280 0.11505 -4.718 2.38e-06 ***
seasonFall -0.36527 0.11407 -3.202 0.00136 **
seasonWinter -2.25310 0.22115 -10.188 < 2e-16 ***
timeMidday -1.52500 0.17832 -8.552 < 2e-16 ***
timeEvening 0.49086 0.09942 4.937 7.92e-07 ***
raining 0.60214 0.12735 4.728 2.27e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9478 0.8240 -3.577 0.000347 ***
seasonSummer -0.4181 0.9587 -0.436 0.662794
seasonFall -1.0851 0.9026 -1.202 0.229254
seasonWinter -1.6069 1.1866 -1.354 0.175663
timeMidday -0.2583 0.8500 -0.304 0.761238
timeEvening -0.3238 0.5918 -0.547 0.584294
raining 4.6145 1.0012 4.609 4.05e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The zero-inflation part uses a logit link (this can’t be changed), so it corresponds to a logistic model (but where we actually don’t observe the outcome!). The coefficients can be interpreted as for a logistic model.
A simpler zero-inflated model
Note that even though we didn’t make any changes to the conditional part of the model, the coefficient estimates did (slightly) change.
Comparing nested models
The two models mod_zi
and mod_zi_rain
are nested (the former can be turned into the latter by setting some of the coefficients to zero). That means that we can use a likelihood ratio test to compare the models. We’re then testing if we need to include season and time (five dummy/indicator variables) or if a model including an intercept and raining
is sufficient. We can use the (misleadingly named) anova()
function to perform likelihood ratio tests on nested models:
anova(mod_zi, mod_zi_rain)
Data: d_sim
Models:
mod_zi_rain: n_fish ~ season + time + raining + offset(log(time_spent)), zi=~raining, disp=~1
mod_zi: n_fish ~ season + time + raining + offset(log(time_spent)), zi=~., disp=~1
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mod_zi_rain 9 928.74 963.72 -455.37 910.74
mod_zi 14 935.32 989.72 -453.66 907.32 3.4267 5 0.6345
The high P-value (0.63) tells us to not reject the simpler model where only raining
has an effect on the probability of staying at home.
Estimating a few more parameters
Estimating expected values
Coral plans to spend the summer fishing in the evening for two hours. If she’s unlucky, it’ll rain. We can estimate the expected number of fish caught for a rainy day:
= data.frame(
d_plan season = "Summer",
time = "Evening",
raining = 1,
time_spent = 2
)predict(mod_zi_rain, newdata = d_plan, type = "response")
[1] 2.626541
The optimal time to go fishing
For all combinations of seasons, times of day and rain / not rain, we can easily estimate the expected number of fish that Coral will catch if she spends one hour fishing:
library(marginaleffects)
= predictions(mod_zi_rain,
d_pred newdata = datagrid(
season = unique,
time = unique,
raining = unique,
time_spent = 1
),type = "response", vcov = FALSE
) d_pred
season time raining time_spent Estimate
Spring Morning 0 1 2.0162
Spring Morning 1 1 1.3710
Spring Midday 0 1 0.4420
Spring Midday 1 1 0.3005
Spring Evening 0 1 3.3251
Spring Evening 1 1 2.2611
Summer Morning 0 1 1.1710
Summer Morning 1 1 0.7963
Summer Midday 0 1 0.2567
Summer Midday 1 1 0.1746
Summer Evening 0 1 1.9313
Summer Evening 1 1 1.3133
Fall Morning 0 1 1.4192
Fall Morning 1 1 0.9651
Fall Midday 0 1 0.3111
Fall Midday 1 1 0.2116
Fall Evening 0 1 2.3405
Fall Evening 1 1 1.5916
Winter Morning 0 1 0.2307
Winter Morning 1 1 0.1569
Winter Midday 0 1 0.0506
Winter Midday 1 1 0.0344
Winter Evening 0 1 0.3805
Winter Evening 1 1 0.2587
Columns: rowid, estimate, n_fish, season, time, raining, time_spent
Type: response