Chapter 9 Negative Binomial and Zero-Inflation

9.1 Dataset

It is a sample of 4,406 individuals, aged 66 and over, who were covered by Medicare in 1988. One of the variables the data provide is number of physician office visits. The dataset is provided by the AER package (Kleiber and Zeileis 2008).

Our goal is to model the number of visits given the other attributes (chronic conditions, status, gender etc.). For sake of simplicity, we only select a subset of attributes which are considered to be relevant.

variable description
visits Number of physician office visits (integer outcome)
nvisits
ovisits
novisits
emergency
hospital Number of hospital stays (integer)
health Self-perceived health status (poor, average, excellent)
chronic Number of chronic condition (integer)
adl
region
age
afam
gender Gender (female, male)
married
school Number of years of education (integer)
income
employed
insurance Private insurance indicator (no, yes)
medicaid
# load data from package
# install.package("AER")
library(AER)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
data("NMES1988")
nmes <- NMES1988[, c(1, 6:8, 13, 15, 18)] # select variables of interest
summary(nmes)
##      visits          hospital           health        chronic     
##  Min.   : 0.000   Min.   :0.000   poor     : 554   Min.   :0.000  
##  1st Qu.: 1.000   1st Qu.:0.000   average  :3509   1st Qu.:1.000  
##  Median : 4.000   Median :0.000   excellent: 343   Median :1.000  
##  Mean   : 5.774   Mean   :0.296                    Mean   :1.542  
##  3rd Qu.: 8.000   3rd Qu.:0.000                    3rd Qu.:2.000  
##  Max.   :89.000   Max.   :8.000                    Max.   :8.000  
##     gender         school      insurance 
##  female:2628   Min.   : 0.00   no : 985  
##  male  :1778   1st Qu.: 8.00   yes:3421  
##                Median :11.00             
##                Mean   :10.29             
##                3rd Qu.:12.00             
##                Max.   :18.00
head(nmes)
##   visits hospital  health chronic gender school insurance
## 1      5        1 average       2   male      6       yes
## 2      1        0 average       2 female     10       yes
## 3     13        3    poor       4 female     10        no
## 4     16        1    poor       2   male      3       yes
## 5      3        0 average       2 female      6       yes
## 6     17        0    poor       5 female      7        no

9.2 EDA

As always, let’s start with some descriptive plots.

library(dplyr)
library(ggplot2)
library(ggpubr)

# analyse response
visits_scatter <- nmes %>%
  arrange(visits) %>%
  ggplot() +
  geom_point(aes(1:nrow(nmes), visits), alpha = .3)
visits_bar <- nmes %>%
  ggplot() +
  geom_bar(aes(visits))
ggarrange(visits_scatter, visits_bar, ncol = 2)

The response presents high variability, we can reduce this by taking the log.

logvisits_scatter <- nmes %>%
  mutate(visits = log(visits + .5)) %>%
  arrange(visits) %>%
  ggplot() +
  geom_point(aes(1:nrow(nmes), visits), alpha = .3)
logvisits_bar <- nmes %>%
  mutate(visits = log(visits + .5)) %>%
  ggplot() +
  geom_histogram(aes(visits), binwidth = .1)

ggarrange(logvisits_scatter, logvisits_bar, ncol = 2)

Now we analyse the relationship with the number of chronic diseases.

# saturate
sat <- function(x, sat_point) {
  x[x > sat_point] <- sat_point # saturate values above 3
  lev <- as.character(0:sat_point) # define levels
  lev[length(lev)] <- paste0(lev[length(lev)], "+")
  x <- as.factor(x) # switch from numeric to factor
  levels(x) <- lev
  return(x)
}

# analyse predictor "chronic"
unbalanced_plt <- nmes %>%
  ggplot() +
  geom_bar(aes(chronic))
unbalanced_plt

This variable seems to have too low support on values from 4 above. This will lead to low accuracy in the visit estimation when we observe a number of chronic diseases higher than 4. We may fix this issue by assuming that a patient having 4 or more diseases will get visited as frequently as a patient with 3 chronic diseases. This means that we set a saturation point at 3 and we transform the data accordingly.

balanced_plt <- nmes %>%
  mutate(chronic_sat = sat(chronic, 3)) %>%
  ggplot() +
  geom_bar(aes(chronic_sat))

ggarrange(unbalanced_plt, balanced_plt)

To summarize, here the two data transformations discussed so far (which we can compare to the original dataset). With a log-transform we fix the highly skewed distribution on the response variable, while with a saturation point on the chronic predictor, we create balanced classes.

# let's add two columns
nmes <- nmes %>%
  mutate(
    chronic_sat = sat(chronic, 3),
    log_visits = log(visits + .5)
  )

# bivariate analysis
biv <- nmes %>%
  ggplot() +
  geom_boxplot(aes(as.factor(chronic), visits))
tr_biv <- nmes %>%
  ggplot() +
  geom_boxplot(aes(chronic_sat, log_visits))

ggarrange(biv, tr_biv, ncol = 2)

There are many less outliers and less variability among classes, which suggests that we can better capture its distribution

Here some other bi-variate plots which can be useful for further inspection of the available data.

library(tidyr)

plts <- list()

plts$health <- nmes %>%
  ggplot(aes(health, log_visits)) +
  geom_boxplot()
plts$chronic <- nmes %>%
  ggplot(aes(chronic_sat, log_visits)) +
  geom_boxplot()
plts$insurance <- nmes %>%
  ggplot(aes(insurance, log_visits)) +
  geom_boxplot()
plts$hospital <- nmes %>%
  mutate(hospital_sat = sat(hospital, 3)) %>%
  ggplot(aes(hospital_sat, log_visits)) +
  geom_boxplot()
plts$gender <- nmes %>%
  ggplot(aes(gender, log_visits)) +
  geom_boxplot()
plts$school <- nmes %>%
  mutate(school = sat(school, 6)) %>%
  ggplot(aes(school, log_visits)) +
  geom_boxplot()

do.call(ggarrange, plts)

9.3 Negative binomial regression

The overdispersion of the data can be captured by a Negative Binomial model, which differs from the Poisson model in that the variance can be different than the mean. Therefore it can account for underdispersed and overdispersed count variates.

First we try with a simple Poisson regression

# define a formula (select the relevant/interesting predictors)
fml <- visits ~ hospital + health + chronic_sat + gender + school + insurance

pois_model <- glm(
  formula = fml, family = poisson(link = "log"), # family object "poisson"
  data = nmes
)
summary(pois_model)
## 
## Call:
## glm(formula = fml, family = poisson(link = "log"), data = nmes)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.848055   0.027418  30.930  < 2e-16 ***
## hospital         0.171588   0.005950  28.841  < 2e-16 ***
## healthpoor       0.277053   0.017401  15.922  < 2e-16 ***
## healthexcellent -0.312714   0.030443 -10.272  < 2e-16 ***
## chronic_sat1     0.361870   0.020591  17.574  < 2e-16 ***
## chronic_sat2     0.580241   0.021405  27.108  < 2e-16 ***
## chronic_sat3+    0.694679   0.021736  31.960  < 2e-16 ***
## gendermale      -0.104541   0.012963  -8.065 7.33e-16 ***
## school           0.025902   0.001842  14.062  < 2e-16 ***
## insuranceyes     0.191394   0.016902  11.323  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 26943  on 4405  degrees of freedom
## Residual deviance: 22928  on 4396  degrees of freedom
## AIC: 35723
## 
## Number of Fisher Scoring iterations: 5

Note: if we print out the coefficient, and we change the scale to match the counts, we see that, for instance, excellent health brings a decrease in the visits, while a number of chronic disease of three or higher, dramatically increases the visits count.

coef(pois_model)
##     (Intercept)        hospital      healthpoor healthexcellent    chronic_sat1 
##      0.84805502      0.17158798      0.27705323     -0.31271426      0.36186975 
##    chronic_sat2   chronic_sat3+      gendermale          school    insuranceyes 
##      0.58024070      0.69467856     -0.10454072      0.02590154      0.19139428
exp(coef(pois_model))
##     (Intercept)        hospital      healthpoor healthexcellent    chronic_sat1 
##       2.3351007       1.1871886       1.3192366       0.7314589       1.4360119 
##    chronic_sat2   chronic_sat3+      gendermale          school    insuranceyes 
##       1.7864684       2.0030651       0.9007381       1.0262399       1.2109368

Then we compare the results with a regression fit on a GLM with Negative Binomial family.

# fit the equivalent NB model
# check
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
nb_model <- glm.nb(formula = fml, data = nmes)
summary(nb_model)
## 
## Call:
## glm.nb(formula = fml, data = nmes, init.theta = 1.218804554, 
##     link = log)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.792055   0.059558  13.299  < 2e-16 ***
## hospital         0.220185   0.020042  10.986  < 2e-16 ***
## healthpoor       0.318920   0.047833   6.667 2.60e-11 ***
## healthexcellent -0.317113   0.061183  -5.183 2.18e-07 ***
## chronic_sat1     0.372721   0.042733   8.722  < 2e-16 ***
## chronic_sat2     0.575107   0.047135  12.201  < 2e-16 ***
## chronic_sat3+    0.721231   0.049283  14.635  < 2e-16 ***
## gendermale      -0.118742   0.031166  -3.810 0.000139 ***
## school           0.027295   0.004381   6.230 4.67e-10 ***
## insuranceyes     0.207695   0.039357   5.277 1.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.2188) family taken to be 1)
## 
##     Null deviance: 5782.5  on 4405  degrees of freedom
## Residual deviance: 5044.7  on 4396  degrees of freedom
## AIC: 24330
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.2188 
##           Std. Err.:  0.0340 
## 
##  2 x log-likelihood:  -24308.2680

Among the fit information we can see that the glm.nb function estimates the dispersion parameter of the Negative Binomial. Keep in mind that there are several parametrization of this distribution, one of which consists of, indeed, mean \(\mu\) and dispersion \(r\) i.e. \(NB(\mu, r)\) such that

\[ \sigma^2 = \mu + \frac{\mu^2}{r}\,,\qquad p = \frac{m}{\sigma^2} \]

# coefficients again
exp(coef(nb_model))
##     (Intercept)        hospital      healthpoor healthexcellent    chronic_sat1 
##       2.2079292       1.2463073       1.3756414       0.7282484       1.4516787 
##    chronic_sat2   chronic_sat3+      gendermale          school    insuranceyes 
##       1.7773208       2.0569629       0.8880371       1.0276705       1.2308380

Let’s compare the coefficients and the confidence intervals:

rbind(exp(coef(pois_model)), exp(coef(nb_model)))
##      (Intercept) hospital healthpoor healthexcellent chronic_sat1 chronic_sat2
## [1,]    2.335101 1.187189   1.319237       0.7314589     1.436012     1.786468
## [2,]    2.207929 1.246307   1.375641       0.7282484     1.451679     1.777321
##      chronic_sat3+ gendermale   school insuranceyes
## [1,]      2.003065  0.9007381 1.026240     1.210937
## [2,]      2.056963  0.8880371 1.027671     1.230838
cbind(confint.default(pois_model), confint.default(nb_model))
##                      2.5 %      97.5 %       2.5 %      97.5 %
## (Intercept)      0.7943160  0.90179405  0.67532364  0.90878650
## hospital         0.1599271  0.18324884  0.18090387  0.25946612
## healthpoor       0.2429475  0.31115895  0.22516928  0.41267092
## healthexcellent -0.3723819 -0.25304665 -0.43703008 -0.19719596
## chronic_sat1     0.3215120  0.40222748  0.28896513  0.45647604
## chronic_sat2     0.5382882  0.62219320  0.48272398  0.66749012
## chronic_sat3+    0.6520769  0.73728020  0.62463832  0.81782282
## gendermale      -0.1299469 -0.07913454 -0.17982625 -0.05765723
## school           0.0222914  0.02951169  0.01870754  0.03588172
## insuranceyes     0.1582661  0.22452241  0.13055767  0.28483287
cbind(
  confint.default(pois_model)[, 2] - confint.default(pois_model)[, 1],
  confint.default(nb_model)[, 2] - confint.default(nb_model)[, 1]
)
##                        [,1]       [,2]
## (Intercept)     0.107478074 0.23346286
## hospital        0.023321722 0.07856225
## healthpoor      0.068211444 0.18750164
## healthexcellent 0.119335218 0.23983412
## chronic_sat1    0.080715454 0.16751091
## chronic_sat2    0.083904993 0.18476614
## chronic_sat3+   0.085203273 0.19318450
## gendermale      0.050812349 0.12216902
## school          0.007220284 0.01717418
## insuranceyes    0.066256263 0.15427520

The confidence intervals are wider, which is an effect of the Negative Binomial letting more uncertainty in the model.

9.4 Zero-Inflation

With a regression fit, we only obtain the means of the Poisson, or Negative Binomial, distributions for each observations. These are the fitted values. However, many observations have response variable counting 0 visits. How many zeros does our model predicts?

mu <- predict(pois_model, type = "response") # get the poisson mean
expected_zero_count <- sum(dpois(x = 0, lambda = mu)) # sum_i (1(yi == 0) * p(yi == 0))
round(expected_zero_count)
## [1] 60

How many are actually 0? Many more…

# actual 0 visits
sum(nmes$visits == 0)
## [1] 683

For this reason we introduce a composite model called Zero-Inflated Poisson: it’s a mixture between a Poisson and a discrete distribution over zero.

\[ P(Y = y) = \pi \mathrm{1}(y = 0) + (1 - \pi) \frac{\lambda^{y} e^{-\lambda}}{y!} \]

Let’s have a look at the generative model and the distribution of ZIP variates. It can be seen as a two-steps process:

  1. Sample a Bernoulli variable which states whether the observation is zero or not-zero
  2. If it is not zero, then sample a Poisson variable with a given mean
n <- 100
pp <- .3 # probability of zero event
ll <- 5
zi_samples <- rbinom(n, 1, 1 - pp)
zi_samples[zi_samples == 1] <- rpois(sum(zi_samples), lambda = ll) # sample poisson
tibble(y = zi_samples) %>%
  ggplot() +
  geom_histogram(aes(y))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

With our data, we do not observe something really like a zero inflation model, but still, with a ZI model, we can have more insight on the presence of many zeros.

nmes %>%
  ggplot() +
  geom_histogram(aes(visits)) +
  xlim(0, 30)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 47 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

nmes %>%
  mutate(gtz = as.factor(ifelse(visits > 0, ">0", "0"))) %>%
  ggplot() +
  geom_bar(aes(gtz))

Let’s try to fit a zero inflation model.

Notice how the zeroinfl function infers two models: the count (pois) and the zero model (logistic regression).

library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
zip_model <- zeroinfl(formula = fml, data = nmes)
summary(zip_model)
## 
## Call:
## zeroinfl(formula = fml, data = nmes)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5758 -1.1488 -0.4766  0.5484 24.6115 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.307611   0.028121  46.499  < 2e-16 ***
## hospital         0.162911   0.006033  27.004  < 2e-16 ***
## healthpoor       0.278539   0.017354  16.051  < 2e-16 ***
## healthexcellent -0.283549   0.031288  -9.063  < 2e-16 ***
## chronic_sat1     0.212773   0.020937  10.163  < 2e-16 ***
## chronic_sat2     0.361709   0.021714  16.658  < 2e-16 ***
## chronic_sat3+    0.440948   0.021973  20.067  < 2e-16 ***
## gendermale      -0.057361   0.013071  -4.388 1.14e-05 ***
## school           0.019133   0.001873  10.215  < 2e-16 ***
## insuranceyes     0.077189   0.017174   4.494 6.98e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.04724    0.15010   0.315  0.75294    
## hospital        -0.30074    0.09149  -3.287  0.00101 ** 
## healthpoor       0.01370    0.16175   0.085  0.93249    
## healthexcellent  0.18472    0.15263   1.210  0.22619    
## chronic_sat1    -0.74134    0.10483  -7.072 1.53e-12 ***
## chronic_sat2    -1.31875    0.13834  -9.533  < 2e-16 ***
## chronic_sat3+   -1.86784    0.17077 -10.938  < 2e-16 ***
## gendermale       0.40582    0.08994   4.512 6.41e-06 ***
## school          -0.05698    0.01232  -4.624 3.77e-06 ***
## insuranceyes    -0.75192    0.10316  -7.289 3.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 27 
## Log-likelihood: -1.611e+04 on 20 Df
round(sum(predict(zip_model, type = "zero"))) # better captures the zero
## [1] 669
# can account also for overdispersion
zinb_model <- zeroinfl(formula = fml, dist = "negbin", data = nmes)
summary(zinb_model)
## 
## Call:
## zeroinfl(formula = fml, data = nmes, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1938 -0.7112 -0.2775  0.3293 17.1983 
## 
## Count model coefficients (negbin with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.096144   0.063810  17.178  < 2e-16 ***
## hospital         0.202671   0.020602   9.837  < 2e-16 ***
## healthpoor       0.303138   0.046057   6.582 4.65e-11 ***
## healthexcellent -0.304768   0.063342  -4.812 1.50e-06 ***
## chronic_sat1     0.260039   0.045197   5.753 8.75e-09 ***
## chronic_sat2     0.419531   0.048339   8.679  < 2e-16 ***
## chronic_sat3+    0.542364   0.049509  10.955  < 2e-16 ***
## gendermale      -0.072537   0.031212  -2.324   0.0201 *  
## school           0.022117   0.004415   5.010 5.44e-07 ***
## insuranceyes     0.106747   0.042311   2.523   0.0116 *  
## Log(theta)       0.401063   0.036208  11.077  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.21623    0.29103  -0.743 0.457493    
## hospital        -0.82495    0.52315  -1.577 0.114820    
## healthpoor       0.14127    0.41427   0.341 0.733091    
## healthexcellent  0.05502    0.33699   0.163 0.870316    
## chronic_sat1    -1.06685    0.22995  -4.639 3.49e-06 ***
## chronic_sat2    -2.22231    0.45963  -4.835 1.33e-06 ***
## chronic_sat3+   -3.42831    0.96764  -3.543 0.000396 ***
## gendermale       0.70380    0.20796   3.384 0.000714 ***
## school          -0.07604    0.02789  -2.726 0.006412 ** 
## insuranceyes    -1.25027    0.23645  -5.288 1.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 1.4934 
## Number of iterations in BFGS optimization: 36 
## Log-likelihood: -1.209e+04 on 21 Df
exp_coeff <- exp(coef(zip_model))
exp_coeff <- matrix(exp_coeff, ncol = 2)
colnames(exp_coeff) <- c("count", "zero")
exp_coeff
##           count      zero
##  [1,] 3.6973299 1.0483782
##  [2,] 1.1769320 0.7402724
##  [3,] 1.3211986 1.0137962
##  [4,] 0.7531066 1.2028785
##  [5,] 1.2371035 0.4764769
##  [6,] 1.4357804 0.2674684
##  [7,] 1.5541806 0.1544569
##  [8,] 0.9442534 1.5005307
##  [9,] 1.0193167 0.9446120
## [10,] 1.0802458 0.4714595

We can also set different models for the two parts

E.g. from the previous summary it seems that health does not affect the visits count. Maybe we can remove it:

zinb_2model <- zeroinfl(
  formula = visits ~
    hospital + health + chronic_sat + gender + school + insurance | # count
      hospital + chronic_sat + gender + school + insurance, # zero
  dist = "negbin",
  data = nmes
)
summary(zinb_2model)
## 
## Call:
## zeroinfl(formula = visits ~ hospital + health + chronic_sat + gender + 
##     school + insurance | hospital + chronic_sat + gender + school + insurance, 
##     data = nmes, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1930 -0.7129 -0.2774  0.3325 17.2308 
## 
## Count model coefficients (negbin with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.095975   0.063712  17.202  < 2e-16 ***
## hospital         0.203030   0.020520   9.894  < 2e-16 ***
## healthpoor       0.299896   0.045025   6.661 2.73e-11 ***
## healthexcellent -0.307842   0.060352  -5.101 3.38e-07 ***
## chronic_sat1     0.261129   0.045084   5.792 6.95e-09 ***
## chronic_sat2     0.420119   0.048342   8.691  < 2e-16 ***
## chronic_sat3+    0.542827   0.049521  10.962  < 2e-16 ***
## gendermale      -0.073013   0.031222  -2.339   0.0194 *  
## school           0.022174   0.004399   5.041 4.64e-07 ***
## insuranceyes     0.105967   0.042156   2.514   0.0119 *  
## Log(theta)       0.399783   0.036029  11.096  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.20007    0.28703  -0.697 0.485788    
## hospital      -0.81416    0.49144  -1.657 0.097584 .  
## chronic_sat1  -1.05197    0.22352  -4.706 2.52e-06 ***
## chronic_sat2  -2.21569    0.46239  -4.792 1.65e-06 ***
## chronic_sat3+ -3.48962    1.06920  -3.264 0.001099 ** 
## gendermale     0.69879    0.20783   3.362 0.000773 ***
## school        -0.07574    0.02732  -2.772 0.005564 ** 
## insuranceyes  -1.26793    0.23136  -5.480 4.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 1.4915 
## Number of iterations in BFGS optimization: 33 
## Log-likelihood: -1.209e+04 on 19 Df

And since the regression model is composed of two models, also the prediction can be split into the two parts as such:

# two models predictions and combined
predict(zip_model, type = "zero")[1:5]
##          1          2          3          4          5 
## 0.09447018 0.06957463 0.03630164 0.11149199 0.08585591
predict(zip_model, type = "count")[1:5]
##         1         2         3         4         5 
##  7.148150  6.943690 14.986611  8.917319  6.432114
predict(zip_model, type = "response")[1:5]
##         1         2         3         4         5 
##  6.472863  6.460585 14.442573  7.923110  5.879879

And finally we can test the models one against the other, comparing the likelihood or using the AIC score.

library(lmtest)
# likelihood test
lmtest::lrtest(zip_model, zinb_2model)
## Likelihood ratio test
## 
## Model 1: visits ~ hospital + health + chronic_sat + gender + school + 
##     insurance
## Model 2: visits ~ hospital + health + chronic_sat + gender + school + 
##     insurance | hospital + chronic_sat + gender + school + insurance
##   #Df LogLik Df Chisq Pr(>Chisq)    
## 1  20 -16107                        
## 2  19 -12085 -1  8044  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# p-value tells whether the logLikelihood difference is significant

# AIC for model selection based on complexity/performance tradeoff
AIC(pois_model, nb_model, zip_model, zinb_2model)
##             df      AIC
## pois_model  10 35723.15
## nb_model    11 24330.27
## zip_model   20 32254.65
## zinb_2model 19 24208.69

9.4.1 Hurdle v. ZI

In general, zeros can come both from the zero model and the count model. Sometimes we might want to model a zero as a separate event. The Hurdle model in fact, alternatively to the ZI model, makes a clear distinction between counts that are zero and counts that are 1 or more. Here is the Hurdle-Poisson model for instance:

\[ P(Y = y) = \pi \mathrm{1}(y = 0) + (1 - \pi) \frac{\lambda^{y} e^{-\lambda}}{y!} \mathrm{1}(y > 0) \] This is particularly useful when the data consist of large counts on average, but for some reasons (e.g. reading errors) many values are 0 instead. It probably doesn’t make sense, in those cases, to account for the probability of that zero being drawn from the same Poisson of that of all the other larger counts.

In R, this can be done by using the hurdle function from the same library.

hurdle_model <- hurdle(formula = fml, data = nmes)
summary(hurdle_model)
## 
## Call:
## hurdle(formula = fml, data = nmes)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5965 -1.1507 -0.4768  0.5482 24.5851 
## 
## Count model coefficients (truncated poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.308658   0.028104  46.565  < 2e-16 ***
## hospital         0.162841   0.006034  26.988  < 2e-16 ***
## healthpoor       0.278497   0.017355  16.047  < 2e-16 ***
## healthexcellent -0.282804   0.031274  -9.043  < 2e-16 ***
## chronic_sat1     0.212190   0.020935  10.136  < 2e-16 ***
## chronic_sat2     0.361344   0.021708  16.646  < 2e-16 ***
## chronic_sat3+    0.440479   0.021973  20.046  < 2e-16 ***
## gendermale      -0.057295   0.013071  -4.383 1.17e-05 ***
## school           0.019050   0.001871  10.180  < 2e-16 ***
## insuranceyes     0.077519   0.017166   4.516 6.31e-06 ***
## Zero hurdle model coefficients (binomial with logit link):
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.0981863  0.1468254  -0.669 0.503669    
## hospital         0.3107238  0.0913541   3.401 0.000671 ***
## healthpoor      -0.0001956  0.1611027  -0.001 0.999031    
## healthexcellent -0.2419271  0.1442703  -1.677 0.093562 .  
## chronic_sat1     0.7585597  0.1024341   7.405 1.31e-13 ***
## chronic_sat2     1.3396653  0.1359419   9.855  < 2e-16 ***
## chronic_sat3+    1.8896854  0.1686629  11.204  < 2e-16 ***
## gendermale      -0.4052651  0.0881205  -4.599 4.25e-06 ***
## school           0.0589871  0.0120460   4.897 9.74e-07 ***
## insuranceyes     0.7449223  0.1012575   7.357 1.88e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 16 
## Log-likelihood: -1.611e+04 on 20 Df
z_pred <- predict(hurdle_model, type = "zero")[1:5]
c_pred <- predict(hurdle_model, type = "count")[1:5]
predict(hurdle_model, type = "response")[1:5]
##         1         2         3         4         5 
##  6.472374  6.460350 14.453157  7.923691  5.879604
# here the composite prediction is merely the product of the two
# models predictions
z_pred * c_pred
##         1         2         3         4         5 
##  6.472374  6.460350 14.453157  7.923691  5.879604

References

Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer-Verlag. https://CRAN.R-project.org/package=AER.