library(GLMsData)
library(glmmTMB)
data(quilpie)
Module 10: Tweedie models
Introduction
In this final module, we will look at a special case of the Tweedie model called the compound Poisson-Gamma mixture model.
We will also look at a few miscellaneous topics: the AIC, autocorrelation and some R-specific details on handling missing data.
The data
We have rainfall data for the town of Quilpie in Australia for the month of July from 1921 to 1988.
See ?quilpie
for additional information. Here’s the rainfall plotted over time:
library(ggplot2)
ggplot(quilpie, aes(x=Year, y = Rain)) +
geom_line()
Quilpie has a dry climate, and for some years, there was no rain at all in July:
table(quilpie$Rain == 0)
FALSE TRUE
48 20
A simple Tweedie model
The Tweedie models supported by the glmmTMB
package are compound Poisson-Gamma mixture models, where the outcome has a positive probability of taking on zero values, but is otherwise continuous.
We will fit the Tweedie distribution to the data (i.e., an intercept-only Tweedie regression model). It has three parameters, though only two are visible in the model output:
= glmmTMB(Rain ~ 1, data = quilpie,
mod family = tweedie())
summary(mod)
Family: tweedie ( log )
Formula: Rain ~ 1
Data: quilpie
AIC BIC logLik deviance df.resid
484.5 491.2 -239.3 478.5 65
Dispersion parameter for tweedie family (): 7
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7984 0.1464 19.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Tweedie parameters
We can easily extract the three Tweedie parameters:
- μ (mu): mean parameter
- ϕ (phi): dispersion parameter
- ξ (xi): power parameter
= mod$modelInfo$family$linkinv # = exp()
linkinv = linkinv(fixef(mod)$cond)
mu = unname(mu) # Get rid of the ‘(Intercept)’ name
mu = sigma(mod)
phi = family_params(mod) xi
The variance of the Tweedie distribution is \(\phi\mu^\xi\).
The underlying Poisson and gamma parameters
It’s also possible to recover the parameters for the underlying Poisson and gamma distribution (see the video for details on the model formulation) using the following formulas:
\[\lambda = \frac{\mu^{2-\xi}}{\phi(2-\xi)}\] \[\text{shape} = \frac{2 - \xi}{\xi - 1}\] \[\text{rate} = \frac{1}{\phi (\xi - 1) \mu^{\xi - 1}}\]
Does the Tweedie distribution fit the data?
For a model to fit the data, we require more than having the model just reproduce the mean and variance. The whole distribution of outcome values should be similar to the model-based distribution.
And of course we can also use our old friend, the DHARMa
package to confirm that the model fits the data:
library(DHARMa)
= simulateResiduals(mod)
sim_resids plot(sim_resids)
Adding a predictor
We will now examine the relationship between the July average Southern Oscillation Index (SOI) and rainfall. The SOI is an index based on the difference in air pressure between Tahiti and Darwin, Australia. Periods of negative SOI values indicate El Niño episodes, which are associated with reduced rainfall in parts of Australia (including Quilpie).
As always, we first start by plotting our data:
ggplot(quilpie, aes(x = SOI, y = Rain)) +
geom_point() +
geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
And indeed, low (and especially negative) values of SOI seem to be associated with little (or even zero) rainfall.
We’ll fit two simple Tweedie models, one with a linear and one with a quadratic term for the SOI predictor:
= glmmTMB(Rain ~ SOI, data = quilpie,
mod_lin family = tweedie())
= glmmTMB(Rain ~ poly(SOI, 2), data = quilpie,
mod_quad family = tweedie())
A model with the term poly(SOI, 2)
is equivalent to SOI + I(SOI^2)
, but with a parameterisation that gives better numerical accuracy. If you don’t need coefficients that are easy to interpret, it’s best to use the poly()
version. (The model fit and the predictions should be identical, except for numerical accuracy issues.)
The P-value for the quadratic term indicates that a linear model is not sufficient:
summary(mod_quad)
Family: tweedie ( log )
Formula: Rain ~ poly(SOI, 2)
Data: quilpie
AIC BIC logLik deviance df.resid
465.8 476.9 -227.9 455.8 63
Dispersion parameter for tweedie family (): 6.12
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.4963 0.1553 16.071 < 2e-16 ***
poly(SOI, 2)1 7.4702 1.6362 4.566 4.98e-06 ***
poly(SOI, 2)2 -3.1534 1.3782 -2.288 0.0221 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that the P-value is based on an (asymptotic) Wald test. It’s slightly better to run a likelihood ratio test (which we can do since the two models are nested, i.e., since mod_lin
is a special case of mod_quad
):
anova(mod_lin, mod_quad)
Data: quilpie
Models:
mod_lin: Rain ~ SOI, zi=~0, disp=~1
mod_quad: Rain ~ poly(SOI, 2), zi=~0, disp=~1
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mod_lin 4 469.29 478.17 -230.65 461.29
mod_quad 5 465.85 476.94 -227.92 455.85 5.4437 1 0.01964 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The conclusion is the same. The predictions also look reasonable:
library(marginaleffects)
plot_predictions(mod_quad,
condition = "SOI",
points = 1, vcov = FALSE
)
As do the DHARMa residuals for the quadratic model:
library(DHARMa)
plot(simulateResiduals(mod_quad))
Model comparisons and the likelihood
Whenever we add variables to a model, we will always get a fit that looks better, in that the likelihood of the model is higher.
This is because we fit the model by finding the set of parameters (coefficient estimates) that maximise the likelihood (or – equivalently – the log-likelihood). The simpler model is a special case of the model with the extra variables (the coefficients for the extra variables are simply set to zero in the simpler model). So adding extra variables can never decrease the (log) likelihood.
Here’s an example, where we add three noise variables (that have no real predictive power).
set.seed(1919)
$noise1 = rnorm(nrow(quilpie))
quilpie$noise2 = rnorm(nrow(quilpie))
quilpie$noise3 = rnorm(nrow(quilpie))
quilpie= update(mod_quad, . ~ . + noise1 + noise2 + noise3)
mod_noise summary(mod_noise)
Family: tweedie ( log )
Formula: Rain ~ poly(SOI, 2) + noise1 + noise2 + noise3
Data: quilpie
AIC BIC logLik deviance df.resid
468.6 486.4 -226.3 452.6 60
Dispersion parameter for tweedie family (): 5.99
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.4744 0.1546 16.006 < 2e-16 ***
poly(SOI, 2)1 7.2087 1.6330 4.414 1.01e-05 ***
poly(SOI, 2)2 -3.0959 1.3673 -2.264 0.0236 *
noise1 0.1574 0.1751 0.899 0.3688
noise2 0.0315 0.1102 0.286 0.7750
noise3 0.1644 0.1284 1.281 0.2003
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The log-likelihood is higher for the noise model (note the minus sign):
logLik(mod_noise)
'log Lik.' -226.297 (df=8)
logLik(mod_quad)
'log Lik.' -227.9234 (df=5)
Of course, we typically require the increase in likelihood to be statistically significantly higher than 0, by doing a likelihood ratio test:
anova(mod_quad, mod_noise)
Data: quilpie
Models:
mod_quad: Rain ~ poly(SOI, 2), zi=~0, disp=~1
mod_noise: Rain ~ poly(SOI, 2) + noise1 + noise2 + noise3, zi=~0, disp=~1
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mod_quad 5 465.85 476.94 -227.92 455.85
mod_noise 8 468.59 486.35 -226.30 452.59 3.2528 3 0.3543
And here, it was not statistically significantly higher than 0, so we would prefer the simpler model that doesn’t include the noise variables.
Model comparisons and the AIC
A likelihood ratio test requires one of the models to be nested inside the other – more complex – model, i.e., to be a special case of the more complex model.
But there’s another model comparison method we can use that is based on a different idea (information theory) and does not require nested models – the AIC (Akaike information criterion).
We simply look at minus two times the ‘log-likelihood + the number of parameters’:
\[AIC = -2\log L(\hat{\theta}) + 2p\] where L(θ) is the likelihood as a function of the parameter (vector), and p is the number of parameters in the model. This can be used to compare a set of pre-specified but not necessarily nested models 1) fitted on the same data, and 2) with the same outcome variable. For AIC, lower values are better (since there’s a minus in front of the log-likelihood).
The AIC has the property that asymptotically, if we select the model with the lowest AIC, we will select the best model (for a specific definition of ‘best’, which we’ll not go into).
The AIC can sometimes be negative, which can be confusing. Just to be clear: With ‘the model with the lowest AIC’, we mean the model with the AIC value that is closest to minus infinity (not necessarily closest to zero).
By fitting a more complex model, we will be increasing the likelihood. But the AIC also includes a penalty term for the number of parameters. By fitting a more complex model, the 2p term will also increase, which may overall increase the AIC (unless the added complexity is really justified). The AIC thus gives us a tradeoff between a better fit to the actual data (a higher likelihood) and overfitting (a model that doesn’t generalise to new datasets).
Here’s an example. We fit a tenth-degree polynomial:
= glmmTMB(
mod_ten ~ poly(SOI, 10),
Rain data = quilpie,
family = tweedie()
)
This is clearly an overfitted model:
plot_predictions(mod_ten, condition = "SOI",
points = 1, vcov = FALSE)
The log-likelihood is higher in the complex model, but the AIC indicates that the simple quadratic model is the better of the two models, as it has the lowest AIC:
logLik(mod_quad)
'log Lik.' -227.9234 (df=5)
logLik(mod_ten)
'log Lik.' -225.5143 (df=13)
AIC(mod_quad, mod_ten)
df AIC
mod_quad 5 465.8468
mod_ten 13 477.0286
While the AIC is lower for the simpler model, this doesn’t mean that the simpler model is a good model. In theory, both models could be bad (ill-fitting). The AIC is only a relative measure of model fit.
Alternative model – zero-inflated gamma
Perhaps an alternative to the Tweedie model could be a zero-inflated gamma model? After all, this should be able to handle both zero values and positive values with a skewed distribution (positive values for the gamma submodel, and zero values for the zero-inflation submodel).
When fitting the models, you might encounter some In (function (start, objective, gradient = NULL, hessian = NULL, ... : NA/NaN function evaluation
warnings. They can safely be ignored, as the model does converge.
While trying to find the optimal set of parameter estimates, the fitting procedure tried some invalid values, which gave the value NA
for the log-likelihood. But eventually, it converged to the optimal ones.
For real convergence problems, which can sometimes occur, the Troubleshooting with glmmTMB vignette and the diagnose()
function can be useful.
Comparing the models using the AIC
The mod_quad
model is not nested in the mod_zi_simple
model (or the other way around), so we can’t use a likelihood ratio test to compare the two models. But we can use the AIC.
Comparing the models using DHARMa residuals
The Tweedie model uses a log link by default, while the zero-inflated gamma model uses an inverse link. Perhaps the zero-inflated gamma model would fit the data better if we used a log link?
Now we’re comparing many models. The AIC should preferably only be used for comparing a small set of pre-specified models.
Comparing likelihood ratio tests and the AIC
Assume we have a simple model nested in a more complex model. Then we can use either a likelihood ratio test or model selection using the AIC.
For the complex model to be preferred based on the AIC, the increase in log-likelihood must exceed the increase in the number of parameters. So if the complex model contains just one additional variable (parameter), the increase in twice the log-likelihood must be greater than 2.
Differences in twice the log-likelihoods for models estimated using maximum likelihood have (asymptotically) a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters for the two models. (This is what we use in likelihood ratio tests.) We see that when comparing our two models using AIC (where the complex model has one extra variable), this corresponds to a likelihood ratio test with the following significance level:
pchisq(2, df = 1, lower.tail = FALSE)
[1] 0.1572992
So choosing between the two models using the AIC is equivalent to using a likelihood ratio test with a significance level of 0.16 instead of the traditional 0.05. So the AIC might very well choose a more complex model than the likelihood ratio test.
(For complex models with even more variables, the corresponding ‘significance level’ will be lower, and will eventually drop below even 0.05.)
AIC can be used to compare models with different predictors, different distribution families and different link functions. However, all models have to be fitted on the same dataset and have the same outcome variable.
Take the smoking and lung capacity dataset of modules 2 and 3 as an example:
- We can’t use the AIC to compare a model fitted on the data for girls to a model fitted on the data for boys (to check if the models fit better for girls than for boys).
- And we can’t use it to compare the model that used FEV as the outcome with a similar model that used log-FEV as the outcome.
- And if we in one model include an additional variable that has some missing values (
NA
values), the observations with missing values will automatically be excluded in the model-fitting process, so the model with be fitted on a different subset of observations. This again means that can’t compare this model to other models without missing data. (We can fix this latter issue by first creating a dataset that only includes variables that have complete data for all the models that we want to compare.)
Different model-fitting packages might also define the AIC differently (by dropping different sets of constants from the likelihood function). This means that in general, we can’t compare AIC values for models fitted using different model-fitting functions. But it’s always safe to compare models fitting using glmmTMB()
.
A different predictor
We said earlier that periods of negative SOI values might ‘predict’ reduced rainfall. The quilpie
dataset also includes a variable Phase
that summarises the SOI pattern seen during the previous few months (including July itself).
Temporal correlation (autocorrelation)
Our data have been recorded over time, so they are really time series data. This means that the important assumption that the individual observations are independent might not be fulfilled.
This will be a problem even for the simple model without any predictors! We can test for temporal correlation (autocorrelation) using a different function in the DHARMa
package:
= simulateResiduals(mod)
sim_resids testTemporalAutocorrelation(mod, quilpie$Year)
Durbin-Watson test
data: simulationOutput$scaledResiduals ~ 1
DW = 2.4628, p-value = 0.05212
alternative hypothesis: true autocorrelation is not 0
The low P-value indicates that we may have some autocorrelation (mostly at lag 1, which means that the July data for one year is correlated with July data for the previous year, but not really with data from two years ago).
If the autocorrelation is caused by a trend (e.g., increasing or decreasing rainfall over time), it might help just adding Year
as a predictor (here, it doesn’t). Adding other predictors might also help.
If adding more predictors doesn’t help, we might have to add a correlation structure that takes autocorrelation into account. This can get complicated, but one way of doing this is briefly mentioned in the videos for module 6.
The take-home message is: Whenever we have data that is measured over time, autocorrelation might be a problem. Test for it using the testTemporalAutocorrelation()
function. And if there’s autocorrelation, try to fix it.
The End
Thanks for finishing all of these exercises – phew! 😃
Now go out and fit glmmTMB
models to your own data!