\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\median}{median} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\logistic}{Logistic} \DeclareMathOperator{\logit}{Logit} \newcommand{\mat}[1]{\boldsymbol{#1}} \newcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} % This follows BDA \newcommand{\dunif}{\mathsf{Uniform}} \newcommand{\dnorm}{\mathsf{Normal}} \newcommand{\dhalfnorm}{\mathrm{HalfNormal}} \newcommand{\dlnorm}{\mathsf{LogNormal}} \newcommand{\dmvnorm}{\mathsf{Normal}} \newcommand{\dgamma}{\mathsf{Gamma}} \newcommand{\dinvgamma}{\mathsf{InvGamma}} \newcommand{\dchisq}{\mathsf{ChiSquared}} \newcommand{\dinvchisq}{\mathsf{InvChiSquared}} \newcommand{\dexp}{\mathsf{Exponential}} \newcommand{\dlaplace}{\mathsf{Laplace}} \newcommand{\dweibull}{\mathsf{Weibull}} \newcommand{\dwishart}{\mathsf{Wishart}} \newcommand{\dinvwishart}{\mathsf{InvWishart}} \newcommand{\dlkj}{\mathsf{LkjCorr}} \newcommand{\dt}{\mathsf{StudentT}} \newcommand{\dhalft}{\mathsf{HalfStudentT}} \newcommand{\dbeta}{\mathsf{Beta}} \newcommand{\ddirichlet}{\mathsf{Dirichlet}} \newcommand{\dlogistic}{\mathsf{Logistic}} \newcommand{\dllogistic}{\mathsf{LogLogistic}} \newcommand{\dpois}{\mathsf{Poisson}} \newcommand{\dBinom}{\mathsf{Binomial}} \newcommand{\dmultinom}{\mathsf{Multinom}} \newcommand{\dnbinom}{\mathsf{NegativeBinomial}} \newcommand{\dnbinomalt}{\mathsf{NegativeBinomial2}} \newcommand{\dbetabinom}{\mathsf{BetaBinomial}} \newcommand{\dcauchy}{\mathsf{Cauchy}} \newcommand{\dhalfcauchy}{\mathsf{HalfCauchy}} \newcommand{\dbernoulli}{\mathsf{Bernoulli}} \newcommand{\R}{\mathbb{R}} \newcommand{\Reals}{\R} \newcommand{\RealPos}{\R^{+}} \newcommand{\N}{\mathbb{N}} \newcommand{\Nats}{\N} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator{\invlogit}{Inv-Logit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\diag}{diag} \]

14 Separation

Prerequisites

library("rstan")
library("rstanarm")
library("tidyverse")
library("recipes")
library("jrnold.bayes.notes")

14.1 Introduction

Separation is when a predictor perfectly predicts a binary response variable (Rainey 2016, @Zorn2005a).

For classification problems, there are three cases.

  • complete separation: the predictor perfectly predicts both 0’s and 1’s.
  • quasi-complete separation: the predictor perfectly predicts either 0’s or 1’s.
  • overlap: the predictor does not perfectly predict either 0’s or 1’s.

Both complete separation and quasi-complete separation cause problems for binomial maximum likelihood estimators.

14.2 Complete Separation

The following data is an example of data with complete separation.10

CompleteSeparation
#> # A tibble: 8 x 3
#>       y    x1    x2
#>   <dbl> <dbl> <dbl>
#> 1     0     1     3
#> 2     0     2     2
#> 3     0     3    -1
#> 4     0     3    -1
#> 5     1     5     2
#> 6     1     6     4
#> # ... with 2 more rows
count(CompleteSeparation, y, x1) %>%
  group_by(x1) %>%
  mutate(p = n / sum(n)) %>%
  select(-n) %>%
  spread(y, p, fill = 0)
#> # A tibble: 7 x 3
#> # Groups:   x1 [7]
#>      x1   `0`   `1`
#>   <dbl> <dbl> <dbl>
#> 1     1     1     0
#> 2     2     1     0
#> 3     3     1     0
#> 4     5     0     1
#> 5     6     0     1
#> 6    10     0     1
#> # ... with 1 more row

The variable x1 perfectly separates y, since when x1 <= 3, y = 0, and when x1 > 3, y = 1.

The MLE of the binomial likelihood with a logistic link function for this data has a finite log-likelihood, but the optimal line is a step function. This pushes the coefficient for the separating variable to \(\infty\).

ggplot(CompleteSeparation, aes(x = x1, y = y)) +
  geom_point() +
  geom_smooth(method = "glm", formula = y ~ x, se = FALSE,
              method.args = list(family = binomial()), colour = "gray")
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

If we estimate a binomial model with this data, it will warn that some observations have predicted probabilities close to zero or one.

fit_cs1 <- glm(y ~ x1 + x2, data = CompleteSeparation, family = binomial())
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit_cs1)
#> 
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(), data = CompleteSeparation)
#> 
#> Deviance Residuals: 
#>         1          2          3          4          5          6  
#> -2.10e-08  -1.40e-05  -2.52e-06  -2.52e-06   1.56e-05   2.10e-08  
#>         7          8  
#>  2.10e-08   2.10e-08  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)
#> (Intercept)    -66.10  183471.72       0        1
#> x1              15.29   27362.84       0        1
#> x2               6.24   81543.72       0        1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1.1090e+01  on 7  degrees of freedom
#> Residual deviance: 4.5454e-10  on 5  degrees of freedom
#> AIC: 6
#> 
#> Number of Fisher Scoring iterations: 24

Additionally, the standard errors are implausibly large.

14.3 Quasi-Separation

The following generated data is an example of quasi-separation.[^quasi-separation]

knitr::kable(QuasiSeparation)
y x1 x2
0 1 3
0 2 0
0 3 -1
0 3 4
1 3 1
1 4 0
1 5 2
1 6 7
1 10 3
1 11 4

The variable x1 almost separates y. When x1 < 3, y = 0, and when x1 > 3, y = 1. Only when x1 = 3, does y takes values of both 0 and 1.

count(QuasiSeparation, y, x1) %>%
  group_by(x1) %>%
  mutate(p = n / sum(n)) %>%
  select(-n) %>%
  spread(y, p, fill = 0)
#> # A tibble: 8 x 3
#> # Groups:   x1 [8]
#>      x1   `0`   `1`
#>   <dbl> <dbl> <dbl>
#> 1     1 1     0    
#> 2     2 1     0    
#> 3     3 0.667 0.333
#> 4     4 0     1    
#> 5     5 0     1    
#> 6     6 0     1    
#> # ... with 2 more rows

In the quasi-separation case, like the complete separation case, the best line is something close to a step function. Unlike the complete separation case, the coefficient for the separating variable takes a finite, but very large value.

ggplot(QuasiSeparation, aes(x = x1, y = y)) +
  geom_point() +
  geom_smooth(method = "glm", formula = y ~ x, se = FALSE,
              method.args = list(family = binomial()), colour = "gray")
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

fit_qs1 <- glm(y ~ x1 + x2, data = QuasiSeparation, family = binomial())
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit_qs1)
#> 
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(), data = QuasiSeparation)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.0042  -0.0001   0.0000   0.0000   1.4689  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)
#> (Intercept)   -58.076  17511.903     0.0     1.00
#> x1             19.178   5837.301     0.0     1.00
#> x2             -0.121      0.610    -0.2     0.84
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 13.4602  on 9  degrees of freedom
#> Residual deviance:  3.7792  on 7  degrees of freedom
#> AIC: 9.779
#> 
#> Number of Fisher Scoring iterations: 21

14.4 Weak Priors

While the likelihood is unidentified, weakly informative priors on the regression coefficients will deal with separation. \[ \beta_k \sim \dnorm(0, 2.5) \] where all the columns of \(\code{x}\) are assumed to have unit variance (or be otherwise standardized). The half-Cauchy prior, \(\dhalfcauchy(0, 2.5)\), suggested in Gelman et al. (2008) is insufficiently informative to to deal with separation (J. Ghosh, Li, and Mitra 2015), but finite-variance weakly informative Student-t or Normal distributions will work.

These are the priors suggested by Stan and used by default in rstanarm rstanarm.

When estimated with stan_glm(), the coefficients of both the complete separation and quasi-separated data are finite.

fit_cs2 <- stan_glm(y ~ x1 + x2, data = CompleteSeparation,
                    family = binomial())
summary(fit_cs2)
#> 
#> Model Info:
#> 
#>  function:     stan_glm
#>  family:       binomial [logit]
#>  formula:      y ~ x1 + x2
#>  algorithm:    sampling
#>  priors:       see help('prior_summary')
#>  sample:       4000 (posterior sample size)
#>  observations: 8
#>  predictors:   3
#> 
#> Estimates:
#>                 mean   sd    2.5%   25%   50%   75%   97.5%
#> (Intercept)    -6.2    2.7 -12.1   -7.8  -5.9  -4.3  -1.6  
#> x1              1.1    0.4   0.3    0.8   1.0   1.3   2.1  
#> x2              0.9    0.8  -0.5    0.3   0.8   1.3   2.5  
#> mean_PPD        0.5    0.1   0.2    0.4   0.5   0.6   0.8  
#> log-posterior  -8.4    1.4 -11.8   -9.1  -8.1  -7.4  -6.9  
#> 
#> Diagnostics:
#>               mcse Rhat n_eff
#> (Intercept)   0.1  1.0  2090 
#> x1            0.0  1.0  2114 
#> x2            0.0  1.0  2412 
#> mean_PPD      0.0  1.0  3796 
#> log-posterior 0.0  1.0  1641 
#> 
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
fit_qs2 <- stan_glm(y ~ x1 + x2, data = QuasiSeparation, family = binomial())
summary(fit_qs2)
#> 
#> Model Info:
#> 
#>  function:     stan_glm
#>  family:       binomial [logit]
#>  formula:      y ~ x1 + x2
#>  algorithm:    sampling
#>  priors:       see help('prior_summary')
#>  sample:       4000 (posterior sample size)
#>  observations: 10
#>  predictors:   3
#> 
#> Estimates:
#>                 mean   sd    2.5%   25%   50%   75%   97.5%
#> (Intercept)    -3.7    1.9  -7.6   -4.9  -3.6  -2.3  -0.2  
#> x1              1.1    0.5   0.2    0.7   1.1   1.4   2.2  
#> x2              0.0    0.4  -0.7   -0.2   0.0   0.3   0.9  
#> mean_PPD        0.6    0.2   0.3    0.5   0.6   0.7   0.9  
#> log-posterior -10.5    1.3 -13.8  -11.1 -10.2  -9.5  -9.0  
#> 
#> Diagnostics:
#>               mcse Rhat n_eff
#> (Intercept)   0.0  1.0  2746 
#> x1            0.0  1.0  1596 
#> x2            0.0  1.0  2469 
#> mean_PPD      0.0  1.0  3688 
#> log-posterior 0.0  1.0  1509 
#> 
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

14.5 Example: Support of ACA Medicaid Expansion

This example is from Rainey (2016) from the original paper Barrilleaux and Rainey (2014) with replication code here. Load the data included in the jrnold.bayes.notes package:

data("politics_and_need", package = "jrnold.bayes.notes")

The observations are the governors of the US states. The outcome variable is their votes on the Affordable Care Act (ACA) Medicaid Expansion. The dataset includes multiple predictors, including whether the governor is a Republican (gop_governor). Add Democratic governors supported the expansion (gop_governor == 0), and only Republican governors (gop_governor == 1) opposed it (though not all).

# count(politics_and_need, gop_governor, oppose_expansion)

This is a case of quasi-separation.

What happens when this model is estimated with MLE by glm()?

aca_fmla <-
  oppose_expansion ~ gop_governor + percent_favorable_aca + gop_leg +
         percent_uninsured + bal2012 + multiplier + percent_nonwhite +
         percent_metro
fit_aca1 <- glm(aca_fmla, data = politics_and_need, family = binomial())
summary(fit_aca1)
#> 
#> Call:
#> glm(formula = aca_fmla, family = binomial(), data = politics_and_need)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -1.738  -0.455   0.000   0.591   2.350  
#> 
#> Coefficients:
#>                        Estimate Std. Error z value Pr(>|z|)
#> (Intercept)           -1.94e+01   3.22e+03   -0.01     1.00
#> gop_governor           2.03e+01   3.22e+03    0.01     0.99
#> percent_favorable_aca  7.31e-03   8.88e-02    0.08     0.93
#> gop_leg                2.43e+00   1.48e+00    1.64     0.10
#> percent_uninsured      1.12e-01   2.72e-01    0.41     0.68
#> bal2012               -7.12e-04   1.14e-02   -0.06     0.95
#> multiplier            -3.22e-01   1.08e+00   -0.30     0.77
#> percent_nonwhite       4.52e-02   8.25e-02    0.55     0.58
#> percent_metro         -7.75e-02   4.74e-02   -1.64     0.10
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 62.687  on 49  degrees of freedom
#> Residual deviance: 31.710  on 41  degrees of freedom
#> AIC: 49.71
#> 
#> Number of Fisher Scoring iterations: 19

Now estimate with rstanarm using the default weakly informative priors.

fit_aca2 <- stan_glm(aca_fmla, data = politics_and_need,
                     family = "binomial",
                     show_messages = FALSE, refresh = -1,
                     verbose = FALSE)
summary(fit_aca2)
#> 
#> Model Info:
#> 
#>  function:     stan_glm
#>  family:       binomial [logit]
#>  formula:      oppose_expansion ~ gop_governor + percent_favorable_aca + gop_leg + 
#>     percent_uninsured + bal2012 + multiplier + percent_nonwhite + 
#>     percent_metro
#>  algorithm:    sampling
#>  priors:       see help('prior_summary')
#>  sample:       4000 (posterior sample size)
#>  observations: 50
#>  predictors:   9
#> 
#> Estimates:
#>                         mean   sd    2.5%   25%   50%   75%   97.5%
#> (Intercept)            -3.6    5.1 -13.6   -6.9  -3.6  -0.2   6.5  
#> gop_governor            3.9    1.6   1.2    2.8   3.8   4.9   7.2  
#> percent_favorable_aca   0.0    0.1  -0.2   -0.1   0.0   0.0   0.1  
#> gop_leg                 2.4    1.3   0.0    1.5   2.3   3.2   5.1  
#> percent_uninsured       0.1    0.2  -0.2    0.0   0.1   0.2   0.5  
#> bal2012                 0.0    0.0   0.0    0.0   0.0   0.0   0.0  
#> multiplier             -0.3    1.0  -2.3   -0.9  -0.3   0.4   1.8  
#> percent_nonwhite        0.0    0.1  -0.1    0.0   0.0   0.1   0.1  
#> percent_metro          -0.1    0.0  -0.1   -0.1  -0.1   0.0   0.0  
#> mean_PPD                0.3    0.1   0.2    0.3   0.3   0.4   0.4  
#> log-posterior         -33.3    2.4 -38.9  -34.6 -32.9 -31.6 -29.8  
#> 
#> Diagnostics:
#>                       mcse Rhat n_eff
#> (Intercept)           0.1  1.0  2754 
#> gop_governor          0.0  1.0  2598 
#> percent_favorable_aca 0.0  1.0  3106 
#> gop_leg               0.0  1.0  3302 
#> percent_uninsured     0.0  1.0  2520 
#> bal2012               0.0  1.0  2774 
#> multiplier            0.0  1.0  3142 
#> percent_nonwhite      0.0  1.0  2253 
#> percent_metro         0.0  1.0  3371 
#> mean_PPD              0.0  1.0  4000 
#> log-posterior         0.1  1.0  1385 
#> 
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

14.6 Questions

  • Estimate the model using stan_glm() with a flat prior, and a Student-t distribution with df = 3. Compare the coefficient estimates and the efficiency (\(\hat{R}\), ESS).

  • J. Ghosh, Li, and Mitra (2015) suggest that a Half-Cauchy prior distribution is insufficient for dealing with separation. Try estimating this model with a Cauchy prior with a scale of 2.5. Compare the coefficient estimates and efficiency (\(\hat{R}\), ESS).

  • See the other application in Rainey (2016) on nuclear proliferation and war. Replicate the analysis with the informative, skeptical, and enthusiastic priors. The data can be found at carlislerainey/priors-for-separation.

14.7 References

See Albert and Anderson (1984), Heinze and Schemper (2002), and Heinze (2006) for discussion about separation.

Rainey (2016) provides a mixed MLE/Bayesian simulation based approach to apply a prior to the variable with separation, while keeping the other coefficients at their MLE values. Since the results are highly sensitive to the prior, multiple priors should be tried (informative, skeptical, and enthusiastic).

Firth (1993) suggests a data-driven Jeffreys invariant prior. This prior was also recommended in Zorn (2005).

Greenland and Mansournia (2015) suggest a log-F prior distribution which has an intuitive interpretation related to the number of observations.