Module 10: Tweedie models

Author

Karl Ove Hufthammer

Introduction

In this final module, we will look at a special case of the Tweedie model called the compound Poisson-Gamma mixture model.

We will also look at a few miscellaneous topics: the AIC, autocorrelation and some R-specific details on handling missing data.

The data

We have rainfall data for the town of Quilpie in Australia for the month of July from 1921 to 1988.

library(GLMsData)
library(glmmTMB)
data(quilpie)

See ?quilpie for additional information. Here’s the rainfall plotted over time:

library(ggplot2)
ggplot(quilpie, aes(x=Year, y = Rain)) +
  geom_line()

Quilpie has a dry climate, and for some years, there was no rain at all in July:

table(quilpie$Rain == 0)


FALSE  TRUE 
   48    20

A simple Tweedie model

The Tweedie models supported by the glmmTMB package are compound Poisson-Gamma mixture models, where the outcome has a positive probability of taking on zero values, but is otherwise continuous.

We will fit the Tweedie distribution to the data (i.e., an intercept-only Tweedie regression model). It has three parameters, though only two are visible in the model output:

mod = glmmTMB(Rain ~ 1, data = quilpie,
              family = tweedie())
summary(mod)

 Family: tweedie  ( log )
Formula:          Rain ~ 1
Data: quilpie

     AIC      BIC   logLik deviance df.resid 
   484.5    491.2   -239.3    478.5       65 


Dispersion parameter for tweedie family ():    7 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.7984     0.1464   19.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Tweedie parameters

We can easily extract the three Tweedie parameters:

μ (mu): mean parameter
ϕ (phi): dispersion parameter
ξ (xi): power parameter

linkinv = mod$modelInfo$family$linkinv # = exp()
mu = linkinv(fixef(mod)$cond)
mu = unname(mu) # Get rid of the ‘(Intercept)’ name
phi = sigma(mod)
xi = family_params(mod)

The variance of the Tweedie distribution is \(\phi\mu^\xi\).

Exercise 1

Calculate the sample mean and sample variance of the July rainfall values using mean() and variance().

Calculate the model-based estimates of the mean and variance and confirm that they are close to their sample counterparts.

The underlying Poisson and gamma parameters

It’s also possible to recover the parameters for the underlying Poisson and gamma distribution (see the video for details on the model formulation) using the following formulas:

\[\lambda = \frac{\mu^{2-\xi}}{\phi(2-\xi)}\] \[\text{shape} = \frac{2 - \xi}{\xi - 1}\] \[\text{rate} = \frac{1}{\phi (\xi - 1) \mu^{\xi - 1}}\]

Exercise 2

Use the above formula to calculate the three parameters and store them as the variables lambda, shape and rate. You can then use the following R code to simulate 1,000 observations from the distribution:

sim_tweedie = function(lambda, shape, rate) {
  n_pois = rpois(1, lambda)
  x_gamma = rgamma(n_pois, shape, rate)
  sum(x_gamma)
}

set.seed(1919) # Reproducible random numbers
y = replicate(1000, sim_tweedie(lambda, shape, rate))

Exercise 3

Confirm that the sample mean and sample variance of the simulated values are close to their theoretical values (the mean and variance of the Tweedie distribution we simulated data from).

Does the Tweedie distribution fit the data?

For a model to fit the data, we require more than having the model just reproduce the mean and variance. The whole distribution of outcome values should be similar to the model-based distribution.

Exercise 4

Plot a histogram of the Quilpie rainfall data and of the simulated rainfall data to confirm that the whole distribution is similar.

And of course we can also use our old friend, the DHARMa package to confirm that the model fits the data:

library(DHARMa)
sim_resids = simulateResiduals(mod)
plot(sim_resids)

Adding a predictor

We will now examine the relationship between the July average Southern Oscillation Index (SOI) and rainfall. The SOI is an index based on the difference in air pressure between Tahiti and Darwin, Australia. Periods of negative SOI values indicate El Niño episodes, which are associated with reduced rainfall in parts of Australia (including Quilpie).

As always, we first start by plotting our data:

ggplot(quilpie, aes(x = SOI, y = Rain)) +
  geom_point() +
  geom_smooth(se = FALSE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

And indeed, low (and especially negative) values of SOI seem to be associated with little (or even zero) rainfall.

We’ll fit two simple Tweedie models, one with a linear and one with a quadratic term for the SOI predictor:

mod_lin = glmmTMB(Rain ~ SOI, data = quilpie,
              family = tweedie())
mod_quad = glmmTMB(Rain ~ poly(SOI, 2), data = quilpie,
              family = tweedie())

Note

A model with the term poly(SOI, 2) is equivalent to SOI + I(SOI^2), but with a parameterisation that gives better numerical accuracy. If you don’t need coefficients that are easy to interpret, it’s best to use the poly() version. (The model fit and the predictions should be identical, except for numerical accuracy issues.)

The P-value for the quadratic term indicates that a linear model is not sufficient:

summary(mod_quad)

 Family: tweedie  ( log )
Formula:          Rain ~ poly(SOI, 2)
Data: quilpie

     AIC      BIC   logLik deviance df.resid 
   465.8    476.9   -227.9    455.8       63 


Dispersion parameter for tweedie family (): 6.12 

Conditional model:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)     2.4963     0.1553  16.071  < 2e-16 ***
poly(SOI, 2)1   7.4702     1.6362   4.566 4.98e-06 ***
poly(SOI, 2)2  -3.1534     1.3782  -2.288   0.0221 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the P-value is based on an (asymptotic) Wald test. It’s slightly better to run a likelihood ratio test (which we can do since the two models are nested, i.e., since mod_lin is a special case of mod_quad):

anova(mod_lin, mod_quad)

Data: quilpie
Models:
mod_lin: Rain ~ SOI, zi=~0, disp=~1
mod_quad: Rain ~ poly(SOI, 2), zi=~0, disp=~1
         Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
mod_lin   4 469.29 478.17 -230.65   461.29                           
mod_quad  5 465.85 476.94 -227.92   455.85 5.4437      1    0.01964 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The conclusion is the same. The predictions also look reasonable:

library(marginaleffects)
plot_predictions(mod_quad,
  condition = "SOI",
  points = 1, vcov = FALSE
)

As do the DHARMa residuals for the quadratic model:

library(DHARMa)
plot(simulateResiduals(mod_quad))

Exercise 5

Do the DHARMa residuals for the linear (mod_lin) model also look reasonable?

Model comparisons and the likelihood

Whenever we add variables to a model, we will always get a fit that looks better, in that the likelihood of the model is higher.

This is because we fit the model by finding the set of parameters (coefficient estimates) that maximise the likelihood (or – equivalently – the log-likelihood). The simpler model is a special case of the model with the extra variables (the coefficients for the extra variables are simply set to zero in the simpler model). So adding extra variables can never decrease the (log) likelihood.

Here’s an example, where we add three noise variables (that have no real predictive power).

set.seed(1919)
quilpie$noise1 = rnorm(nrow(quilpie))
quilpie$noise2 = rnorm(nrow(quilpie))
quilpie$noise3 = rnorm(nrow(quilpie))
mod_noise = update(mod_quad, . ~ . + noise1 + noise2 + noise3)
summary(mod_noise)

 Family: tweedie  ( log )
Formula:          Rain ~ poly(SOI, 2) + noise1 + noise2 + noise3
Data: quilpie

     AIC      BIC   logLik deviance df.resid 
   468.6    486.4   -226.3    452.6       60 


Dispersion parameter for tweedie family (): 5.99 

Conditional model:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)     2.4744     0.1546  16.006  < 2e-16 ***
poly(SOI, 2)1   7.2087     1.6330   4.414 1.01e-05 ***
poly(SOI, 2)2  -3.0959     1.3673  -2.264   0.0236 *  
noise1          0.1574     0.1751   0.899   0.3688    
noise2          0.0315     0.1102   0.286   0.7750    
noise3          0.1644     0.1284   1.281   0.2003    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The log-likelihood is higher for the noise model (note the minus sign):

logLik(mod_noise)

'log Lik.' -226.297 (df=8)

logLik(mod_quad)

'log Lik.' -227.9234 (df=5)

Of course, we typically require the increase in likelihood to be statistically significantly higher than 0, by doing a likelihood ratio test:

anova(mod_quad, mod_noise)

Data: quilpie
Models:
mod_quad: Rain ~ poly(SOI, 2), zi=~0, disp=~1
mod_noise: Rain ~ poly(SOI, 2) + noise1 + noise2 + noise3, zi=~0, disp=~1
          Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
mod_quad   5 465.85 476.94 -227.92   455.85                         
mod_noise  8 468.59 486.35 -226.30   452.59 3.2528      3     0.3543

And here, it was not statistically significantly higher than 0, so we would prefer the simpler model that doesn’t include the noise variables.

Model comparisons and the AIC

A likelihood ratio test requires one of the models to be nested inside the other – more complex – model, i.e., to be a special case of the more complex model.

But there’s another model comparison method we can use that is based on a different idea (information theory) and does not require nested models – the AIC (Akaike information criterion).

We simply look at minus two times the ‘log-likelihood + the number of parameters’:

\[AIC = -2\log L(\hat{\theta}) + 2p\] where L(θ) is the likelihood as a function of the parameter (vector), and p is the number of parameters in the model. This can be used to compare a set of pre-specified but not necessarily nested models 1) fitted on the same data, and 2) with the same outcome variable. For AIC, lower values are better (since there’s a minus in front of the log-likelihood).

The AIC has the property that asymptotically, if we select the model with the lowest AIC, we will select the best model (for a specific definition of ‘best’, which we’ll not go into).

Note

The AIC can sometimes be negative, which can be confusing. Just to be clear: With ‘the model with the lowest AIC’, we mean the model with the AIC value that is closest to minus infinity (not necessarily closest to zero).

By fitting a more complex model, we will be increasing the likelihood. But the AIC also includes a penalty term for the number of parameters. By fitting a more complex model, the 2p term will also increase, which may overall increase the AIC (unless the added complexity is really justified). The AIC thus gives us a tradeoff between a better fit to the actual data (a higher likelihood) and overfitting (a model that doesn’t generalise to new datasets).

Here’s an example. We fit a tenth-degree polynomial:

mod_ten = glmmTMB(
  Rain ~ poly(SOI, 10),
  data = quilpie,
  family = tweedie()
)

This is clearly an overfitted model:

plot_predictions(mod_ten, condition = "SOI",
                 points = 1, vcov = FALSE)

The log-likelihood is higher in the complex model, but the AIC indicates that the simple quadratic model is the better of the two models, as it has the lowest AIC:

logLik(mod_quad)

'log Lik.' -227.9234 (df=5)

logLik(mod_ten)

'log Lik.' -225.5143 (df=13)

AIC(mod_quad, mod_ten)

         df      AIC
mod_quad  5 465.8468
mod_ten  13 477.0286

Note

While the AIC is lower for the simpler model, this doesn’t mean that the simpler model is a good model. In theory, both models could be bad (ill-fitting). The AIC is only a relative measure of model fit.

Alternative model – zero-inflated gamma

Perhaps an alternative to the Tweedie model could be a zero-inflated gamma model? After all, this should be able to handle both zero values and positive values with a skewed distribution (positive values for the gamma submodel, and zero values for the zero-inflation submodel).

Exercise 6

Fit two such models, one with just an intercept (~1) for the zero-inflation submodel, and one with a second-degree polynomial. (For the mean/continuous submodels, use the same formula as for the Tweedie model mod_quad.) Call the two models mod_zi_simple and mod_zi_complex. Use the summary() function to show the parameter estimates.

Hint: See the exercises in module 7 for details on how to fit zero-inflated models. For a zero-inflated gamma model, you have to use the ziGamma() family function (not the Gamma() function).

Note

When fitting the models, you might encounter some In (function (start, objective, gradient = NULL, hessian = NULL, ... : NA/NaN function evaluation warnings. They can safely be ignored, as the model does converge.

While trying to find the optimal set of parameter estimates, the fitting procedure tried some invalid values, which gave the value NA for the log-likelihood. But eventually, it converged to the optimal ones.

For real convergence problems, which can sometimes occur, the Troubleshooting with glmmTMB vignette and the diagnose() function can be useful.

Comparing the models using the AIC

The mod_quad model is not nested in the mod_zi_simple model (or the other way around), so we can’t use a likelihood ratio test to compare the two models. But we can use the AIC.

Exercise 7

Use the AIC values for mod_quad, mod_zi_simple and mod_zi_complex to decide which of the three models you should prefer.

Comparing the models using DHARMa residuals

Exercise 8

Also compare the DHARMa residuals. Based on them, which model fits the data adequately?

The Tweedie model uses a log link by default, while the zero-inflated gamma model uses an inverse link. Perhaps the zero-inflated gamma model would fit the data better if we used a log link?

Exercise 9

Refit the two zero-inflated gamma models, but now using the log link.

Based on the AIC values, which link function seems to fit the data best?
Based on the AIC values, which of the five models (mod_quad and the four zero-inflated models) seem to fit the data best?

Note

Now we’re comparing many models. The AIC should preferably only be used for comparing a small set of pre-specified models.

Comparing likelihood ratio tests and the AIC

Assume we have a simple model nested in a more complex model. Then we can use either a likelihood ratio test or model selection using the AIC.

For the complex model to be preferred based on the AIC, the increase in log-likelihood must exceed the increase in the number of parameters. So if the complex model contains just one additional variable (parameter), the increase in twice the log-likelihood must be greater than 2.

Differences in twice the log-likelihoods for models estimated using maximum likelihood have (asymptotically) a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters for the two models. (This is what we use in likelihood ratio tests.) We see that when comparing our two models using AIC (where the complex model has one extra variable), this corresponds to a likelihood ratio test with the following significance level:

pchisq(2, df = 1, lower.tail = FALSE)

[1] 0.1572992

So choosing between the two models using the AIC is equivalent to using a likelihood ratio test with a significance level of 0.16 instead of the traditional 0.05. So the AIC might very well choose a more complex model than the likelihood ratio test.

(For complex models with even more variables, the corresponding ‘significance level’ will be lower, and will eventually drop below even 0.05.)

Exercise 10

Let’s fit a model with the year as a predictor.

mod_year = glmmTMB(
  Rain ~ Year,
  data = quilpie,
  family = tweedie()
)

We then compare the model with the intercept-only model:

anova(mod, mod_year)

Data: quilpie
Models:
mod: Rain ~ 1, zi=~0, disp=~1
mod_year: Rain ~ Year, zi=~0, disp=~1
         Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
mod       3 484.54 491.20 -239.27   478.54                         
mod_year  4 484.44 493.32 -238.22   476.44 2.1029      1      0.147

AIC(mod, mod_year)

         df      AIC
mod       3 484.5442
mod_year  4 484.4413

The likelihood ratio test doesn’t reject the intercept-only model, so based on this test, we should prefer the simpler intercept-only model.

However, based on the AIC, we should prefer the more complex model that includes year as a predictor. Use the P-value from the likelihood ratio test to explain why this conclusion for the AIC-based model comparison was to be expected.

Warning

AIC can be used to compare models with different predictors, different distribution families and different link functions. However, all models have to be fitted on the same dataset and have the same outcome variable.

Take the smoking and lung capacity dataset of modules 2 and 3 as an example:

We can’t use the AIC to compare a model fitted on the data for girls to a model fitted on the data for boys (to check if the models fit better for girls than for boys).
And we can’t use it to compare the model that used FEV as the outcome with a similar model that used log-FEV as the outcome.
And if we in one model include an additional variable that has some missing values (NA values), the observations with missing values will automatically be excluded in the model-fitting process, so the model with be fitted on a different subset of observations. This again means that can’t compare this model to other models without missing data. (We can fix this latter issue by first creating a dataset that only includes variables that have complete data for all the models that we want to compare.)

Different model-fitting packages might also define the AIC differently (by dropping different sets of constants from the likelihood function). This means that in general, we can’t compare AIC values for models fitted using different model-fitting functions. But it’s always safe to compare models fitting using glmmTMB().

A different predictor

We said earlier that periods of negative SOI values might ‘predict’ reduced rainfall. The quilpie dataset also includes a variable Phase that summarises the SOI pattern seen during the previous few months (including July itself).

Exercise 11

Plot the July rainfall against the SOI phase. Does the SOI phase seem to predict the July rainfall?

(We here use the term ‘predict’ loosely, as the SOI is partly based on data from the same month.)

Hint: To avoid overplotting, you might find the position_jitter() function from the module 2 exercises useful.

Exercise 12

Fit a Tweedie model with SOI phase as the only predictor variable. Call the model mod_phase. Show the summary() output.

Hint: Since Phase is numeric, you have to convert it to a factor first. See earlier modules (e.g., module 2) for the syntax.

Exercise 13

Based on the AIC, does the model using the SOI in July (mod_quad) or the SOI trend (mod_phase) best fit the data?

Exercise 14

Plot the DHARMa residuals for the mod_phase model. Does the model seem to fit the data?

Temporal correlation (autocorrelation)

Our data have been recorded over time, so they are really time series data. This means that the important assumption that the individual observations are independent might not be fulfilled.

This will be a problem even for the simple model without any predictors! We can test for temporal correlation (autocorrelation) using a different function in the DHARMa package:

sim_resids = simulateResiduals(mod)
testTemporalAutocorrelation(mod, quilpie$Year)


    Durbin-Watson test

data:  simulationOutput$scaledResiduals ~ 1
DW = 2.4628, p-value = 0.05212
alternative hypothesis: true autocorrelation is not 0

The low P-value indicates that we may have some autocorrelation (mostly at lag 1, which means that the July data for one year is correlated with July data for the previous year, but not really with data from two years ago).

If the autocorrelation is caused by a trend (e.g., increasing or decreasing rainfall over time), it might help just adding Year as a predictor (here, it doesn’t). Adding other predictors might also help.

If adding more predictors doesn’t help, we might have to add a correlation structure that takes autocorrelation into account. This can get complicated, but one way of doing this is briefly mentioned in the videos for module 6.

Note

The take-home message is: Whenever we have data that is measured over time, autocorrelation might be a problem. Test for it using the testTemporalAutocorrelation() function. And if there’s autocorrelation, try to fix it.

The End

Thanks for finishing all of these exercises – phew! 😃

Now go out and fit glmmTMB models to your own data!