Gaps across social categories like race, class, and gender are important to understand. We would like to know whether there is anything we can do to close these gaps. What if we intervened to reduce incarceration or increase access to education? Would those interventions close gaps across categories of race, class, or gender?
These types of questions are at the core of a growing literature in epidemiology addresses these questions with techniques for causal decomposition analysis (VanderWeele and Robinson (2014), Jackson and VanderWeele (2018), Jackson (2020)). The paper which accompanies this software package introduces these methods to a broader set of social scientists.
Lundberg, Ian. Forthcoming. “The gap-closing estimand: A causal approach to study interventions that close disparities across social categories.” Sociological Methods and Research. Preprint available at https://doi.org/10.31235/osf.io/gx4y3.
This package provides software to support inquiry into gap-closing estimands. A guiding principle is to partition the tasks the human user must do from the tasks that the package can automate.
The human user must carry out these tasks:
Given those inputs, the package will automatically:
The structure of the vignette is as follows.
In a data frame data
, we have a gap-defining category
such as race, gender, or class. We have a binary treatment variable that
could have counterfactually been different for any individual. We want
to know the degree to which an intervention to change the treatment
would close gaps across the categories.
These boxes will present an example.
Example. Suppose we have the following data.
- X (
category
): Category of interest, taking values {A, B, C}- T (
treatment
): Binary treatment variable, taking values 0 and 1- L (
confounder
): A continuous confounding variable, Uniform(-1,1)- Y (
outcome
): A continuous outcome variable, conditionally normal
## category confounder treatment outcome
## 1 C 1.640509 1 -0.7970377
## 2 B 1.032373 0 1.9304909
## 3 C 2.217630 1 -0.2009803
## 4 A -1.642914 0 -0.5015702
## 5 A -1.844897 0 -1.2027025
## 6 A -2.415100 0 -3.9288188
The gapclosing()
function estimates gaps across
categories and the degree to which they would close under the specified
counterfactual_assignments
of the treatment.
estimate <- gapclosing(
data = simulated_data,
counterfactual_assignments = 1,
outcome_formula = formula(outcome ~ confounder + category*treatment),
treatment_formula = formula(treatment ~ confounder + category),
category_name = "category",
se = TRUE,
# You can process the bootstrap in parallel with as many cores as available
parallel_cores = 2
)
By default, this function will do the following:
gapclosing
object which supports
summary
, print
, and plot
functions.In this example, the plot(estimate)
function produces
the following visualization. The factual outcomes are unequal across
categories, but the counterfactual outcomes are roughly equal. In this
simulated setting, the intervention almost entirely closes the gaps
across the categories.
Figure 1 produced by plot() function
The disparityplot()
function lets us zoom in on the
factual and counterfactual disparity between two categories, of
interest. In this case, we see that the intervention lifts outcomes in
category A to be more comparable to category B. A
disparityplot
is a ggplot2
object and can be
customized by passing additional layers.
Figure 1 produced by plot() function
The summary
function will print estimates, standard
errors, and confidence intervals for all of these results.
## Gap-closing estimates using doubly_robust estimation on one sample.
##
## Treatment model was glm estimation with model formula:
## formula(treatment ~ confounder + category)
##
## Outcome model was lm estimation with model formula:
## formula(outcome ~ confounder + category * treatment)
##
## Factual estimates are means within and disparities across category.
## Counterfactual estimates are under an intervention to set to 1.
## Standard errors are calculated from 1000 bootstrap samples.
##
## Factual mean outcomes:
## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic arguments:
## • digits = digits
## • quote = quote
## • right = right
## • row.names = row.names
## • max = max
## • scientific = FALSE
## # A tibble: 3 × 5
## category estimate se ci.min ci.max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A -1.09 0.0919 -1.27 -0.911
## 2 B -0.0384 0.0754 -0.186 0.109
## 3 C 0.495 0.0787 0.341 0.650
##
## Counterfactual mean outcomes (post-intervention means):
## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic arguments:
## • digits = digits
## • quote = quote
## • right = right
## • row.names = row.names
## • max = max
## • scientific = FALSE
## # A tibble: 3 × 5
## category estimate se ci.min ci.max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A -0.102 0.155 -0.406 0.202
## 2 B 0.0409 0.100 -0.156 0.238
## 3 C 0.0805 0.100 -0.116 0.277
##
## Factual disparities:
## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic arguments:
## • digits = digits
## • quote = quote
## • right = right
## • row.names = row.names
## • max = max
## • scientific = FALSE
## # A tibble: 6 × 5
## category estimate se ci.min ci.max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A - B -1.05 0.122 -1.29 -0.814
## 2 A - C -1.59 0.119 -1.82 -1.35
## 3 B - A 1.05 0.122 0.814 1.29
## 4 B - C -0.534 0.107 -0.745 -0.323
## 5 C - A 1.59 0.119 1.35 1.82
## 6 C - B 0.534 0.107 0.323 0.745
##
## Counterfactual disparities (gap-closing estimands):
## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic arguments:
## • digits = digits
## • quote = quote
## • right = right
## • row.names = row.names
## • max = max
## • scientific = FALSE
## # A tibble: 6 × 5
## category estimate se ci.min ci.max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A - B -0.143 0.187 -0.510 0.224
## 2 A - C -0.183 0.186 -0.548 0.183
## 3 B - A 0.143 0.187 -0.224 0.510
## 4 B - C -0.0396 0.140 -0.313 0.234
## 5 C - A 0.183 0.186 -0.183 0.548
## 6 C - B 0.0396 0.140 -0.234 0.313
##
## Additive gap closed: Counterfactual - Factual
## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic arguments:
## • digits = digits
## • quote = quote
## • right = right
## • row.names = row.names
## • max = max
## • scientific = FALSE
## # A tibble: 6 × 5
## category estimate se ci.min ci.max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A - B -0.909 0.147 -1.20 -0.621
## 2 A - C -1.40 0.147 -1.69 -1.11
## 3 B - A 0.909 0.147 0.621 1.20
## 4 B - C -0.494 0.0882 -0.667 -0.321
## 5 C - A 1.40 0.147 1.11 1.69
## 6 C - B 0.494 0.0882 0.321 0.667
##
## Proportional gap closed: (Counterfactual - Factual) / Factual
## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic arguments:
## • digits = digits
## • quote = quote
## • right = right
## • row.names = row.names
## • max = max
## • scientific = FALSE
## # A tibble: 6 × 5
## category estimate se ci.min ci.max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A - B 0.864 0.173 0.525 1.20
## 2 A - C 0.885 0.115 0.661 1.11
## 3 B - A 0.864 0.173 0.525 1.20
## 4 B - C 0.926 0.293 0.351 1.50
## 5 C - A 0.885 0.115 0.661 1.11
## 6 C - B 0.926 0.293 0.351 1.50
##
## Type plot(name_of_this_object) to visualize results.
This section provides a more detailed overview of the use of
gapclosing()
. It is structured from the perspective of the
three key tasks for the researcher: define the intervention, make causal
assumptions for identification, and specify a treatment and/or outcome
model. Along the way, this section introduces many of the possible
arguments in a gapclosing()
call.
To answer a gap-closing question, we first need to define what that intervention would be. To what treatment value would units be counterfactually assigned? There are several options.
counterfactual_assignments = 0
or
counterfactual_assignments = 1
. In this case, we are
studying the disparity we would expect if a random person of each
category were assigned control or if assigned treatment
(respectively).counterfactual_assignments
= π for some π between 0 and 1. In this case, we
are studying the disparity we would expect if a random person of each
category were assigned to treatment with probability π and to control with probability
1 − π.counterfactual_assignments
argument.Example. We are interested in the disparity across populations defined by
category
that would persist under counterfactual assignment to settreatment
to the value 1.counterfactual_assignments = 1
The package does help you with this step. Gap-closing estimands involve unobserved potential outcomes. Because they are unobserved, the data cannot tell us which variables are needed for estimation. Instead, that is a conceptual choice to be carried out with tools like Directed Acyclic Graphs (DAGs). See the accompanying Lundberg (Forthcoming) paper for more on identification.
Example. Assume that the set of variables {X, L} is a sufficient conditioning set to identify the gap-closing estimand. Formally, this requires us to assume that within each stratum of X and L the expected value of the potential outcome Y(1) is the same as the expected value among units who factually have T = 1 within those strata. 𝔼(Y(1) ∣ X, L) = 𝔼(Y ∣ X, L, T = 1) DAGs are a good way to reason about this assumption: in this example, conditioning (depicted by boxes) on
category
andconfounder
is sufficient to identify the causal effect oftreatment
(blue edge in the DAG), because doing so blocks all backdoor paths between the treatment and the outcome. Notably, the gap-closing estimand makes no claims about the causal effect ofcategory
since the counterfactual is defined overtreatment
only.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Once we select a sufficient conditioning set, those predictors will appear in both the treatment and/or the outcome model used for estimation.
We need to estimate one or both of (1) the probability of treatment given confounders and (2) the conditional mean of the outcome given treatment and confounders. We do that by providing one or both of the following.
treatment_formula
with the binary treatment
variable on the left side and all confounders on the right side. This
formula will be used in treatment_algorithm
(see next step)
to estimate the probability of treatment given confounders for each
unit. Then, the sample-average outcome under counterfactual assignment
to any given treatment can be estimated through inverse probability
weighting among units who factually received that treatment.Example.
treatment_formula = formula(outcome ~ confounder + category)
outcome_formula
with the outcome variable on
the left side and the treatment and all confounders on the right side.
This formula will be used in outcome_algorithm
(see next
step) to estimate the conditional mean of the outcome given treatment
and confounders. Then, it can be used to predict the potential outcome
each unit would expect under any treatment of interest, given
confounding variables.Example.
outcome_formula = formula(outcome ~ confounder + category*treatment)
Whether treatment_formula
or
outcome_formula
is left NULL
will determine
the estimation procedure.
treatment_formula
is provided, then estimation
will be by inverse probability of treatment weighting.outcome_formula
is provided, then estimation
will be by prediction of unobserved potential outcomes (the g-formula, see Hernán and Robins (2021)).treatment_formula
and
outcome_formula
are provided, then the primary estimate
will be doubly robust, with separate treatment and outcome modeling
estimates accessible through the returned object.The treatment_formula
and outcome_formula
are handed to treatment_algorithm
and
outcome_algorithm
, which can take the following values.
glm
(for treatment_algorithm
only): A
logistic regression model estimated by the glm
function. If
there are important interactions among treatment, category, and
covariates, these should be included explicitly.lm
(for outcome_algorithm
only): An OLS
regression model estimated by the lm
function. If there are
important interactions among treatment, category, and covariates, these
should be included explicitly.ridge
: A ridge (i.e. L2-penalized) logistic
regression model (for treatment_algorithm
) or linear
regression model (for outcome_algorithm
) estimated by the
glmnet
function in the glmnet
package (Friedman, Hastie, and Tibshirani 2010), with
alpha = 0
to indicate the ridge penalty. Elastic net and
lasso regression are not supported because those approaches could
regularize some coefficients to exactly zero, and if you have chosen the
needed confounders (step 2) then you would not want their coefficients
to be regularized to zero. The penalty term is chosen by
cv.glmnet
and is set to the value lambda.min
(see glmnet
documentation). If there are important
interactions among treatment, category, and covariates, these should be
included explicitly.gam
: A Generalized Additive Model (logistic if used as
treatment_algorithm
, linear if used as
outcome_algorithm
) estimated by the gam
function in the mgcv
package (Wood
2017). The model formula should be specified as in the
mgcv
documentation and may include smooth terms
s()
for continuous covariates. If there are important
interactions among treatment, category, and covariates, these should be
included explicitly.ranger
: A random forest estimated by the
ranger
function in the ranger
package (Wright and Ziegler 2017). If used as
treatment_algorithm
, one forest will be fit and predicted
treatment probabilities will be truncated to the [.001,.999] range to
avoid extreme inverse probability of treatment weights. If used as
outcome_algorithm
, the forest will be estimated separately
on treated and control units; the treatment variable does not need to be
included in outcome_formula
in this case.Example.
treatment_algorithm = "glm"
outcome_algorithm = "lm"
If the data are a sample from a population selected with unequal
probabilities, you can also use the weight_name
option to
pass estimation functions the name of the sampling weight (a variable in
data
proportional to the inverse probability of sample
inclusion). If omitted, a simple random sample is assumed.
Doubly-robust estimation yields advantages that can be conceptualized in two ways.
Although double robustness has strong mathematical properties, in any given application with a finite sample it is possible that treatment or outcome modeling could outperform doubly-robust estimation. Therefore, the package supports all three approaches.
Taking the bias-correction view of double robustness above, it is
clear that sample splitting affords a further opportunity for
improvement: if you learn an outcome model and estimate its
average bias on the same sample, you might get a poor estimate of the
bias. For this reason, one should consider using one sample (which I
call data_learn
) to learn the prediction functions and
another sample (which I call data_estimate
) to estimate the
bias and aggregate to an estimate of the estimand.
In particular, the option sample_split = "cross_fit"
allows the user to specify that estimation should proceed by a
cross-fitting procedure which is analogous to cross-validation.
n_folds
(default here is n_folds = 2
)n_folds
times with each fold playing the role of f′ in turnThis is the procedure that Chernozhukov et al. (2018) argue is critical to double machine learning for causal estimation, although this type of sample splitting is not new (Bickel 1982).
If sample_split = "cross_fit"
, the default is to conduct
2-fold cross-fitting, but this can be changed with the
n_folds
argument. The user can also specify their own
vector folds
of fold assignments of length
nrow(data)
, if there is something about the particular
setting that would make a manual fold assignment preferable.
Example. (this is the default and can be left implicit)
sample_split = "one_sample"
The package supports bootstrapped standard error estimation. The
procedure with bootstrap_method = "simple"
(the default) is
valid when the data are a simple random sample from the target
population. In this case, each bootstrap iteration conducts estimation
on a resampled dataset selected with replacement with equal
probabilities. The standard error is calculated as the standard
deviation of the estimate across bootstrap samples, and confidence
intervals are calculated by a normal approximation.
Example. (line 2 is the default and can be left implicit)
se = T
bootstrap_samples = 1000
In some settings, the sample size may be small and categories or
treatments of interest may be rare. In these cases, it is possible for
one or more simple bootstrap samples to contain zero cases in some
(treatment × category) cell of
interest. To avoid this problem,
bootstrap_method = "stratified"
conducts bootstrap
resampling within blocks defined by (treatment × category). This procedure is valid if you
assume that the data are selected at random from the population within
these strata, so that across repeated samples from the true population
the proportion in each stratum would remain the same.
Many samples are not simple random samples. In complex sample
settings, users should implement their own standard error procedures to
accurately capture sampling variation related to how their data were
collected. The way the data were collected could motivate a resampling
strategy to mimic the sources of variation in that sampling process,
which the user can implement manually by calling gapclosing
to calculate a point estimate on each resampled dataset with
se = FALSE
.
Suppose you want to relax parametric functional form assumptions by
plugging in a machine learning estimator. As the user, this simply
involves changing the arguments to the gapclosing()
function.
mgcv
Perhaps you are concerned about linearity assumptions: the continuous
confounder, for instance, might actually have a nonlinear association
with the outcome. We know the truth is linear in this simulated example,
but in practice you would never know. You can address this concern by
estimating with a GAM, using the s()
operator from
mgcv
for smooth terms (see Wood
(2017)).
estimate_gam <- gapclosing(
data = simulated_data,
counterfactual_assignments = 1,
outcome_formula = formula(outcome ~ s(confounder) + category*treatment),
treatment_formula = formula(treatment ~ s(confounder) + category),
category_name = "category",
treatment_algorithm = "gam",
outcome_algorithm = "gam",
sample_split = "cross_fit"
# Note: Standard errors with `se = TRUE` are supported.
# They are omitted here only to speed vignette build time.
)
ranger
Perhaps you are concerned that the true treatment probability and
expected outcome functions have many interactions among the predictors.
You can set treatment_algorithm
and
outcome_algorithm
to “ranger” to estimate via the
ranger
function in the ranger
package (Wright and Ziegler 2017).
One aspect of the way gapclosing()
operationalizes
ranger()
is unique out of all the estimation algorithm
options. When you choose a random forest, it is because you believe
there are many important interactions. Some of the most important
interactions may be between the treatment and the other predictors.
Therefore, outcome_algorithm = ranger
enforces those
interactions by estimating the outcome model separately for treated and
control units. For this reason, when
outcome_algorithm = ranger
there is no need to include the
treatment variable explicitly in the outcome_formula
.
estimate_ranger <- gapclosing(
data = simulated_data,
counterfactual_assignments = 1,
outcome_formula = formula(outcome ~ confounder + category),
treatment_formula = formula(treatment ~ confounder + category),
category_name = "category",
treatment_algorithm = "ranger",
outcome_algorithm = "ranger",
sample_split = "cross_fit"
# Note: Standard errors with `se = TRUE` are supported.
# They are omitted here only to speed vignette build time.
)
In this simulation, the GLM models are correctly specified and there are no nonlinearities or interactions for the machine learning approaches to learn. In this case, the sample size is large enough that those approaches correctly learn the linear functional form, and all three estimation strategies yield similar estimates.
Note that confidence intervals for GAM and random forest can also be
generated with SE = TRUE
, which is turned off here only to
speed vignette build time.
The assumptions of a parametric model are always doubtful, leading to
a common question of whether one should always use a more flexible
machine learning approach like ranger
. In a very large
sample, a flexible learner would likely be the correct choice. In the
sample sizes of social science settings, the amount of data may
sometimes be insufficient for these algorithms to discover a complex
functional form. When the parametric assumptions are approximately true,
the parametric estimators may have better performance in small sample
sizes. What counts as “small” and “large” is difficult to say outside of
any specific setting.
The examples above focused on estimation for a fixed treatment
assignment: assign to treatment 1 with probability 1. But we might also
want to know about the gap-closing estimand if we assigned people to
treatment stochastically with some probability between 0 and 1. The
counterfactual_assignments
argument can handle this
possibility.
For example, consider the gap-closing estimand if assigned to treatment 1 with each probability .75.
estimate_stochastic <- gapclosing(
data = simulated_data,
counterfactual_assignments = .75,
outcome_formula = formula(outcome ~ confounder + category*treatment),
treatment_formula = formula(treatment ~ confounder + category),
category_name = "category"
)
The disparity between categories A and B under that stochastic
intervention (0.75 probability of treatment = 1) is estimated to be
0.44, whereas under the previous deterministic intervention to assign
treatment to the value 1 the disparity would be 0.14. This illustrates
an important point: the gap-closing estimand can be different depending
on the counterfactual assignment rule, as the figure below shows for
counterfactuals in which treatment is assigned with probabilities
ranging from 0 to 1.
Your stochastic assignments can also be different for different
people. For example, suppose we assign those in Category A to treatment
1 with probability .5, those in Category B to treatment with probability
.4, and those in Category C to treatment with probability .3. In this
case, counterfactual_assignments
will be set to a vector of
length nrow(data)
.
our_assignments <- case_when(simulated_data$category == "A" ~ .5,
simulated_data$category == "B" ~ .4,
simulated_data$category == "C" ~ .3)
estimate_stochastic <- gapclosing(
data = simulated_data,
counterfactual_assignments = our_assignments,
outcome_formula = formula(outcome ~ confounder + category*treatment),
treatment_formula = formula(treatment ~ confounder + category),
category_name = "category"
)
That intervention would close the B - A gap by 31%.
The gapclosing
package is designed to support inquiry
into gap closing estimands, thus promoting new understanding about
interventions that can close gaps across social categories. The goal of
the package is to automate technical tasks (sample splitting,
aggregation to doubly robust estimates, visualization), thus freeing the
researcher to devote more attention to scientific tasks like defining
the intervention and making causal assumptions.
If you use this package and find a bug, it would be most helpful if you would create an issue on GitHub. Suggestions for additional features are also welcome.
This vignette was compiled on 2025-03-10 02:19:55.078165.