import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
from matplotlib import pyplot as plt
from patsy import dmatrix

import causalpy as cp

## Setting for Mac OS spawning multi-process defaults on M1 chip
sampler_kwargs = {
    "tune": 2000,
    "draws": 2000,
    "target_accept": 0.95,
    "mp_ctx": "spawn",
    "random_seed": 1040,
}

The Paradox of Propensity Scores in Bayesian Inference#

In causal inference the role of the propensity score is often seen to be central. For instance, we’ve seen how the propensity score can be used with the cp.InversePropensityWeighting class to correct for a species of selection bias by re-weighting our outcome variable and calculating a causal contrast on the re-weighted scale. Additionally we can use the propensity score to visualise and diagnose problems of overlap or covariate balancing across treatment and control groups.

These properties give the propensity score a large role to play in design based approaches to causal inference. The focus there is on assessing aspects of the treatment allocation to ensure we have identifiability assurances for estimands of interest. What then is their role in model-based or analysis focused Bayesian methods?

When we use cp.InversePropensityWeighting to apply various re-weighting techniques we perform a two-step manoeuvre: 1. we estimate the propensity score and 2. apply the inverse-weighting of the score to transform our outcome variable and assess causal contrasts. But being good Bayesians, we might wonder why go to all this trouble? Can we not simply estimate a full-bayesian model of treatment and outcome simultaneously?

In this notebook we’ll show why we should be careful attempting to model the joint-distribution of the propensity score and the outcome variable, but still make good use of the propensity score.

Brief Digression on the Mathematics#

Consider that we have the following three variables:

\[ P(Y, T, X) \]

where \(Y\) is our outcome variable, \(T\) is our treatment variable, and \(X\) stands in for all other control variables in scope. Now define the propensity score

\[ e(X) = P(T | X)\]

and our outcome model

\[\begin{split}\begin{align*} P(Y \mid T, X) &= \frac{P(Y, T, X)}{P(T, X)} \cdot P(T, X) \\ &= \frac{P(Y \mid T, X) \cdot P(T \mid X) \cdot P(X)}{P(T \mid X) \cdot P(X)} \cdot P(T, X) \\ &= \frac{P(Y \mid T, X) \cdot e(X) \cdot P(X)}{e(X) \cdot P(X)} \cdot P(T, X) \end{align*} \end{split}\]

but now it’s clearer to see how the propensity score just cancels out. When we’re already conditioning on \(X, T\) the information in the propensity score is technically redundant in the Bayesian setting. Add the assumption of unconfoundedness or ignorability used in causal inference. We are arguing that there is no umeasured confounding so conditioning on \(X, T\) should be sufficient to identify the causal contrast of interest.

Structure of the Presentation#

We will first consider a simple simulated data set where we know the true values of the data generating process and then we’ll demonstrate how to fit the joint distribution of the propensity score and the outcome in a single joint distribution using PyMC. To contrast this we’ll show how to fit a two-stage version of the same model. In both models we assume independent priors between the regression for the treatment and the regression for the outcome variable. Nevertheless, we will show that the joint model exhibits a bias due to feedback when there is non-measured confounding i.e. when there is misspecified outcome model.

Model Specification
- Specifying the Joint Model
- Specifying the 2 Stage Model
Application to Simulated Outcome
Application to Mosquito Net Data
Application to Lalonde Data
Application to NHEFS data.

Note the presentation here owes a debt to the work of Fan Li in [Fan et al., 2023] and her presentation here. Additionally we drew on the work and data of Andrew Heiss here and Jordan Nafa and Andrew Heiss here.

Generate Some Data#

N = 4000
np.random.seed(1043)


def inv_logit(z):
    """Compute the inverse logit (sigmoid) of z."""
    return 1 / (1 + np.exp(-z))


df1 = pd.DataFrame(
    {
        "x1": np.random.normal(0, 1, N),
        "x2": np.random.normal(0, 1, N),
        "x3": np.random.normal(0, 1, N),
    }
)

TREATMENT_EFFECT = 2
df1["trt"] = np.random.binomial(1, inv_logit(df1["x1"] + 6 * df1["x2"] + 7 * df1["x3"]))
df1["outcome"] = (
    1
    + TREATMENT_EFFECT * df1["trt"]
    + df1["x1"]
    + 0.2 * df1["x2"]
    + -3 * df1["x3"]
    + np.random.normal(0, 1, N)
)


df1.head()

	x1	x2	x3	trt	outcome
0	0.570753	-1.372992	0.247824	0	1.114661
1	0.063973	0.904728	0.322222	1	2.952142
2	-0.356653	-0.188922	-1.369041	0	5.809954
3	-1.372285	-0.611158	-1.755242	0	5.509733
4	0.029955	1.773686	-0.455371	1	6.058045

Specifying the Joint Model#

Now we define a model context that fits our data simultaneously for treatment and outcome. We allow that the propensity score estimated in the treatment model is used in a non-parametric spline to flexibly inform the outcome variable. Note that we will allow for a richer model specification of the treatment than we do for the outcome model. This is to try and push the models to use the propensity score information to adjust for unmeasured confounding in the outcome model.

coords = {
    "betas": ["trt", "x1"],
    "betas_trt": ["x1", "x2", "x3"],
    "obs": range(df1.shape[0]),
}

N = df1.shape[0]
X_trt = df1[["x1", "x2", "x3"]].values
X_outcome = df1[["trt", "x1"]].values
T_data = df1["trt"].values
Y_data = df1["outcome"].values


def make_joint_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors={
        "beta_": [0, 1],
        "beta_trt": [0, 1],
        "alpha_trt": [0, 1],
        "alpha_outcome": [0, 1],
        "sigma": 1,
        "beta_ps": [0, 1],
    },
    noncentred=True,
    normal_outcome=True,
):
    with pm.Model(coords=coords) as model:
        X_data_trt = pm.Data("X", X_trt, dims=("obs", "beta_trt"))
        X_data_outcome = pm.Data("X_outcome", X_outcome, dims=("obs", "betas"))
        T_data_ = pm.Data("T", T_data, dims="obs")
        Y_data_ = pm.Data("Y", Y_data, dims="obs")

        if noncentred:
            mu_beta_trt, sigma_beta_trt = priors["beta_trt"]
            beta_trt_std = pm.Normal("beta_trt_std", 0, 1, dims="betas_trt")
            beta_trt = pm.Deterministic(
                "beta_trt_",
                mu_beta_trt + sigma_beta_trt * beta_trt_std,
                dims="betas_trt",
            )

            mu_beta, sigma_beta = priors["beta_"]
            beta_std = pm.Normal("beta_std", 0, 1, dims="betas")
            beta = pm.Deterministic(
                "beta_", mu_beta + sigma_beta * beta_std, dims="betas"
            )
        else:
            beta_trt = pm.Normal(
                "beta_trt_",
                priors["beta_trt"][0],
                priors["beta_trt"][1],
                dims="betas_trt",
            )

            beta = pm.Normal(
                "beta_", priors["beta_"][0], priors["beta_"][1], dims="betas"
            )

        beta_ps = pm.Normal("beta_ps", priors["beta_ps"][0], priors["beta_ps"][1])

        alpha_trt = pm.Normal(
            "alpha_trt", priors["alpha_trt"][0], priors["alpha_trt"][1]
        )
        mu_trt = alpha_trt + pm.math.dot(X_data_trt, beta_trt)
        p = pm.Deterministic("p", pm.math.invlogit(mu_trt), dims="obs")

        pm.Bernoulli("t_pred", p=p, observed=T_data_, dims="obs")

        alpha_outcome = pm.Normal(
            "alpha_outcome", priors["alpha_outcome"][0], priors["alpha_outcome"][1]
        )
        mu_outcome = alpha_outcome + pm.math.dot(X_data_outcome, beta) + beta_ps * p
        sigma = pm.HalfNormal("sigma", priors["sigma"])

        if normal_outcome:
            _ = pm.Normal("like", mu_outcome, sigma, observed=Y_data_, dims="obs")
        else:
            nu = pm.Exponential("nu", lam=1 / 10)
            _ = pm.StudentT(
                "like", nu=nu, mu=mu_outcome, sigma=sigma, observed=Y_data_, dims="obs"
            )

    return model


model = make_joint_model(X_trt, X_outcome, T_data, Y_data, coords)

pm.model_to_graphviz(model)

../_images/d086873eaa5cc5cfa413a14a4af222b81ead62efe520ecc4da10b9d537eb9620.svg

Note how the two likelihood terms are fit simultaneously.

Note

We are specifying the models in raw PyMC code for clarity, but we will bundle these methods into the neater CausalPy API below.

Specifying the 2 Stage Model#

Here we allow for a function that takes the same inputs but fits two separate models. First we fit the treatment model then store the idata_treatment this xarray object stores the posterior estimates for the propensity score. We pass this through to a second outcome model where we proceed to take a random draw from the posterior and pass it through to the outcome regression via a spline component. This allows us to express any non-linearity in the treatment effect. Additionally it can be seen as a way to augment the outcome model.

While theoretically the propensity score contains no extra information if we are already conditioning on \(X\), practically the literature reports that the propensity improves the stability of the causal estimates achievable in Bayesian causal modelling. Additionally we might want to separate covariates for predicting the outcome and the treatment. In this case, there may be extra information derived in the treatment model that be used to inform the outcome model.

def make_treatment_model(
    X_trt,
    T_data,
    coords,
    priors={
        "beta_": [0, 1],
        "beta_trt": [0, 1],
        "alpha_trt": [0, 1],
        "alpha_outcome": [0, 1],
    },
    noncentred=True,
):
    with pm.Model(coords=coords) as model_trt:
        X_data_trt = pm.Data("X", X_trt, dims=("obs", "betas_trt"))
        T_data_ = pm.Data("T", T_data, dims="obs")

        if noncentred:
            mu_beta_trt, sigma_beta_trt = priors["beta_trt"]
            beta_trt_std = pm.Normal("beta_trt_std", 0, 1, dims="betas_trt")
            beta_trt = pm.Deterministic(
                "beta_trt_",
                mu_beta_trt + sigma_beta_trt * beta_trt_std,
                dims="betas_trt",
            )

        else:
            beta_trt = pm.Normal(
                "beta_trt_",
                priors["beta_trt"][0],
                priors["beta_trt"][1],
                dims="betas_trt",
            )

        alpha_trt = pm.Normal(
            "alpha_trt", priors["alpha_trt"][0], priors["alpha_trt"][1]
        )
        mu_trt = alpha_trt + pm.math.dot(X_data_trt, beta_trt)
        p = pm.Deterministic("p", pm.math.invlogit(mu_trt), dims="obs")

        pm.Bernoulli("t_pred", p=p, observed=T_data_, dims="obs")
    return model_trt


def make_outcome_model(
    X_outcome,
    Y_data,
    coords,
    priors={
        "beta_": [0, 1],
        "beta_trt": [0, 1],
        "alpha_trt": [0, 1],
        "alpha_outcome": [0, 1],
        "sigma": 1,
        "beta_ps": [0, 1],
    },
    noncentred=True,
    spline_component=False,
    propensity_score_idata=None,
    normal_outcome=True,
    winsorize_boundary=0.0,
):
    propensity_scores = az.extract(propensity_score_idata)["p"]
    with pm.Model(coords=coords) as model_outcome:
        X_data_outcome = pm.Data("X_outcome", X_outcome, dims=("obs", "betas"))
        Y_data_ = pm.Data("Y", Y_data, dims="obs")

        if noncentred:
            mu_beta, sigma_beta = priors["beta_"]
            beta_std = pm.Normal("beta_std", 0, 1, dims="betas")
            beta = pm.Deterministic(
                "beta_", mu_beta + sigma_beta * beta_std, dims="betas"
            )
        else:
            beta = pm.Normal(
                "beta_", priors["beta_"][0], priors["beta_"][1], dims="betas"
            )

        beta_ps = pm.Normal("beta_ps", priors["beta_ps"][0], priors["beta_ps"][1])

        chosen = np.random.choice(range(propensity_scores.shape[1]))
        p = propensity_scores[:, chosen].values
        p = np.clip(p, winsorize_boundary, 1 - winsorize_boundary)

        alpha_outcome = pm.Normal(
            "alpha_outcome", priors["alpha_outcome"][0], priors["alpha_outcome"][1]
        )
        mu_outcome = alpha_outcome + pm.math.dot(X_data_outcome, beta) + beta_ps * p

        if spline_component:
            beta_ps_spline = pm.Normal(
                "beta_ps_spline", priors["beta_ps"][0], priors["beta_ps"][1], size=14
            )
            B = dmatrix(
                "bs(ps, knots=knots, degree=3, include_intercept=True, lower_bound=0, upper_bound=1) - 1",
                {"ps": p, "knots": np.linspace(0, 1, 10)},
            )
            B_f = np.asarray(B, order="F")
            splines_summed = pm.Deterministic(
                "spline_features", pm.math.dot(B_f, beta_ps_spline.T), dims="obs"
            )
            mu_outcome = (
                alpha_outcome + pm.math.dot(X_data_outcome, beta) + splines_summed
            )

        sigma = pm.HalfNormal("sigma", priors["sigma"])

        if normal_outcome:
            _ = pm.Normal("like", mu_outcome, sigma, observed=Y_data_, dims="obs")
        else:
            nu = pm.Exponential("nu", lam=1 / 10)
            _ = pm.StudentT(
                "like", nu=nu, mu=mu_outcome, sigma=sigma, observed=Y_data_, dims="obs"
            )

    return model_outcome


def make_2step_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors,
    spline_component=False,
    normal_outcome=True,
    winsorize_boundary=0.0,
):
    treatment_model = make_treatment_model(X_trt, T_data, coords, priors)
    with treatment_model:
        idata_treatment = pm.sample_prior_predictive()
        idata_treatment.extend(pm.sample(**sampler_kwargs))

    outcome_model = make_outcome_model(
        X_outcome,
        Y_data,
        coords,
        priors,
        spline_component=spline_component,
        propensity_score_idata=idata_treatment,
        normal_outcome=normal_outcome,
        winsorize_boundary=winsorize_boundary,
    )
    with outcome_model:
        idata_outcome = pm.sample_prior_predictive()
        idata_outcome.extend(pm.sample(**sampler_kwargs))

    return idata_treatment, idata_outcome, treatment_model, outcome_model


model_treatment = make_treatment_model(X_trt, T_data, coords)

pm.model_to_graphviz(model_treatment)

../_images/885a377b84692246a09a2a30e47663602ccbb6221af5e9016fb6cf3951effc9f.svg

Specifying a Simple Regression Model without Propensity Scores#

Now we specify a simple regression model which does not make use of the propensity score information. This model will be used to assess how much extra information is gleaned from the presence of the propensity score as covariate in our outcome model.

def make_reg_model(
    X_outcome,
    Y_data,
    coords,
    priors={
        "beta_": [0, 1],
        "alpha_outcome": [0, 1],
        "sigma": 1,
    },
    noncentred=True,
):
    with pm.Model(coords=coords) as reg_model:
        X_data_outcome = pm.Data("X_outcome", X_outcome, dims=("obs", "betas"))
        Y_data_ = pm.Data("Y", Y_data, dims="obs")

        if noncentred:
            mu_beta, sigma_beta = priors["beta_"]
            beta_std = pm.Normal("beta_std", 0, 1, dims="betas")
            beta = pm.Deterministic(
                "beta_", mu_beta + sigma_beta * beta_std, dims="betas"
            )
        else:
            beta = pm.Normal(
                "beta_", priors["beta_"][0], priors["beta_"][1], dims="betas"
            )
        alpha_outcome = pm.Normal(
            "alpha_outcome", priors["alpha_outcome"][0], priors["alpha_outcome"][1]
        )
        mu_outcome = alpha_outcome + pm.math.dot(X_data_outcome, beta)
        sigma = pm.HalfNormal("sigma", priors["sigma"])

        _ = pm.Normal("like", mu_outcome, sigma, observed=Y_data_, dims="obs")

        idata = pm.sample(**sampler_kwargs)

    return reg_model, idata

Putting it all Together#

We are now in a position to fit both joint and modular (2-stage) models to our simulated data. We are seeking to assess how the different approaches to incorporating propensity score information impacts the accuracy of the treatment effect estimate.

Note

We are using this snippet of code to pass the posterior distribution of the propensity score through to the outcome model and have it be sampled in the MCMC process of the outcome model.

chosen = np.random.choice(range(propensity_scores.shape[1]))
p = propensity_scores[:, chosen].values

This allows us to modularise the the fitting process but retain the useful information stored in the propensity score.

priors = {
    "beta_": [0, 3],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [0, 1],
    "sigma": 1,
    "beta_ps": [0, 1],
}

joint_model = make_joint_model(X_trt, X_outcome, T_data, Y_data, coords, priors=priors)

with joint_model:
    idata_joint = pm.sample(**sampler_kwargs)

(
    idata_treatment_2s_joint,
    idata_outcome_2s_joint,
    treatment_model_joint,
    outcome_model_joint,
) = make_2step_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors, spline_component=False
)

(
    idata_treatment_2s_joint_spline,
    idata_outcome_2s_joint_spline,
    treatment_model_joint_spline,
    outcome_model_joint_spline,
) = make_2step_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors, spline_component=True
)

reg_model, idata_outcome_simple_reg = make_reg_model(X_outcome, Y_data, coords)

Show code cell output Hide code cell output

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, beta_std, beta_ps, alpha_trt, alpha_outcome, sigma]

/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 80 seconds.
Sampling: [alpha_trt, beta_trt_std, t_pred]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, alpha_trt]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 20 seconds.
Sampling: [alpha_outcome, beta_ps, beta_std, like, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, alpha_outcome, sigma]

/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 16 seconds.
Sampling: [alpha_trt, beta_trt_std, t_pred]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, alpha_trt]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 18 seconds.
Sampling: [alpha_outcome, beta_ps, beta_ps_spline, beta_std, like, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, alpha_outcome, beta_ps_spline, sigma]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 37 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, alpha_outcome, sigma]

/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 12 seconds.

ax = az.plot_forest(
    [
        idata_outcome_2s_joint,
        idata_outcome_2s_joint_spline,
        idata_joint,
        idata_outcome_simple_reg,
    ],
    var_names=["beta_"],
    model_names=["2 Stage", "2 Stage + Spline", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 5),
)

ax[0].axvline(2, label="True Treatment Value", color="k")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");

../_images/338f8bae146b188dade704e9d2a4018557c33c8b37ac7e8c79ab8e21c9a6c7b4.png

		mean	sd	hdi_3%	hdi_97%	r_hat
1-stage-model	alpha_trt	0.000	0.021	-0.040	0.041	1.0
	beta_[trt]	2.314	0.057	2.206	2.419	1.0
	beta_[x1]	1.296	0.070	1.167	1.428	1.0
	beta_ps	-15.297	0.294	-15.840	-14.738	1.0
	alpha_outcome	8.507	0.161	8.216	8.821	1.0
2-stage-model	beta_[trt]	2.058	0.189	1.696	2.398	1.0
	beta_[x1]	1.155	0.041	1.079	1.235	1.0
	alpha_outcome	3.067	0.060	2.953	3.176	1.0
	beta_ps	-4.147	0.210	-4.541	-3.759	1.0
2-stage-model_spline	beta_[trt]	1.824	0.179	1.490	2.161	1.0
	beta_[x1]	1.162	0.040	1.087	1.236	1.0
	alpha_outcome	1.007	0.293	0.441	1.544	1.0
	beta_ps	-0.009	0.983	-1.797	1.880	1.0
Simple Regression	beta_[trt]	-1.308	0.085	-1.464	-1.144	1.0
	beta_[x1]	1.107	0.043	1.029	1.190	1.0
	alpha_outcome	2.665	0.060	2.549	2.773	1.0

Here the models fail to recover substantially similar and correct results, in particular the simple regression model is widely off. However the 2-stage models seem to perform better than the joint model specification. This is interesting and demonstrates a key property of propensity scores in the Bayesian setting. Propensity scores are useful correctives within a regression context but we need to be careful how the model is specified.

The Problem of Feedback#

The issue here is sometimes called Bayesian feedback or “collider bias via the likelihood” [Griffith et al., 2020], and it’s a key issue when trying to build joint models for causal inference in the Bayesian paradigm. Because we have fit the outcome and the treatment models simultaneously, and this means that the outcome can influence the posterior distribution of the parameters \(\beta\) in the treatment model and it violates the idea of design-before-analysis. We have here an apparent example of a slight bias due to this effect. The two stage modular approach seems to better recover the treatment effect reported in the literature and avoids the risk of collider bias i.e. in the modular implementation we are able to use the propensity score to adjust for accuracy and compensate for the missing variables x2 and x1.

💡 Key Take-away: With an underspecified outcome model, we may use a well specified propensity score for adjusting the model to retrieve accurate treatment effect estimates. However, this tends to breakdown if we have estimated both propensity score and outcome in a joint bayesian model due to feedback effects. The solution is to use the propensity score in a 2 stage fashion.

pm.model_to_graphviz(outcome_model_joint_spline)

../_images/538b8cefba3705ca7c01e6210df7ae80ac9a125c8bda35f6a8346190a5ae7497.svg

Propensity Score Quantiles - Joint Model#

We can see how the different model specifications have yielded distinct propensity score estimates - as the joint specification seems to compensate for missing covariates in the outcome model by adjusting the propensity score latent in the treatment model too. We can see the differences in the quantiles to highlight a numeric difference in the propensity score distributions between the two models.

idata_joint["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values

array([0.224, 0.308, 0.38 , 0.442, 0.502, 0.563, 0.624, 0.691, 0.769])

Propensity Score Quantiles - 2 Stage Modular Model#

The range an position of these quantiles is vastly different than the joint model specification.

idata_treatment_2s_joint["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values

array([0.   , 0.001, 0.014, 0.11 , 0.506, 0.886, 0.986, 0.999, 1.   ])

Comparing Propensity Score Skew#

We have seen how the treatment effect reported by both models differ and that the joint model exhibits a bias away from the true treatment effect. But we might want to see how this bias manifests in the propensity score distribution.

def compare_propensity_dists(idata_2s, idata_1s):
    fig, ax = plt.subplots(figsize=(10, 4))
    for i in range(100):
        s2 = idata_2s["posterior"].stack(z=("chain", "draw"))["p"][:, i]
        s1 = idata_1s["posterior"].stack(z=("chain", "draw"))["p"][:, i]
        if i == 0:
            ax.hist(
                s2,
                alpha=0.1,
                color="orange",
                label="Propensity Scores 2-stage Estimation",
            )
            # Pivoted to compare shape
            ax.hist(
                1 - s1,
                alpha=0.1,
                color="blue",
                label="Propensity Scores 1-stage Estimation",
            )
        else:
            ax.hist(s2, alpha=0.01, color="orange")
            # Pivoted to compare shape
            ax.hist(1 - s1, alpha=0.01, color="blue")
    ax.legend()
    ax.set_title(
        "Comparing Propensity Distributions \n Two-stage and Joint Estimation routines"
    )


compare_propensity_dists(idata_treatment_2s_joint, idata_joint)

../_images/5d85c9d83bd0687c0274b4b4626e9a353c6f81573a2efe4d249594e5bc9253d9.png

These are radically different models due to the feedback mechanism. How does this phenomena play out in some real world examples?

spline_features = az.extract(idata_outcome_2s_joint_spline)["spline_features"]

propensity_scores = az.extract(idata_treatment_2s_joint_spline)["p"]

fig, ax = plt.subplots(figsize=(10, 5))
for i in range(100):
    temp = pd.DataFrame(
        {"prop_score": propensity_scores[:, i], "spline": spline_features[:, i]}
    )
    temp.sort_values("prop_score", inplace=True)
    ax.plot(temp["prop_score"], temp["spline"], alpha=0.01, color="blue")

temp = pd.DataFrame(
    {
        "prop_score": propensity_scores.mean(axis=1),
        "spline": spline_features.mean(axis=1),
    }
)
temp.sort_values("prop_score", inplace=True)
ax.plot(temp["prop_score"], temp["spline"], color="k", label="Expected Value")
ax.set_title("Additive Spline Effect on the Propensity Score")
ax.legend()
ax.set_ylabel("Spline Feature Contribution")
ax.set_xlabel("Propensity Score");

../_images/dd6ba1fd971b277108abf544051687ffbaa5391c2630d7aeca557e1960f13294.png

Note

The idea of allowing a spline component to contribute to the mu_outcome or expectation of our model, is that in some cases there might be a non-linear relationship between the propensity for treatment and the outcome interest.

\[ \mu = \alpha + X\beta + f(p) \]

Incorporating the spline function of p grants our model this kind of flexibility to account for different response categories across levels of the propensity score.

Causal Estimate with Do-Operator#

We can also confirm the model implications through counterfactual imputation. This ties the Bayesian setting back to the potential outcome framework. The fundamental problem of causal inference, when seen as a missing data problem allows us to derive causal estimands through imputation of the potential outcomes. Here we “push forward” the posterior predictive distribution for \(Y\) under different treatment settings.

X_outcome_trt = X_outcome.copy()
X_outcome_trt[:, 0] = 1

X_outcome_ntrt = X_outcome.copy()
X_outcome_ntrt[:, 0] = 0

First we specify our counterfactual input data. Then we push them through the joint model distribution using the do-operator in PyMC to sample from the posterior predictive distribution giving us sample of the potential outcomes \(Y(1), Y(0)\)

with pm.do(
    joint_model,
    {"T": np.ones(len(df1), dtype=np.int32), "X_outcome": X_outcome_trt},
    prune_vars=True,
) as treatment_model:
    idata_trt = pm.sample_posterior_predictive(idata_joint, var_names=["like", "p"])

with pm.do(
    joint_model,
    {"T": np.zeros(len(df1), dtype=np.int32), "X_outcome": X_outcome_ntrt},
    prune_vars=True,
) as ntreatment_model:
    idata_ntrt = pm.sample_posterior_predictive(idata_joint, var_names=["like", "p"])

Sampling: [like]

Sampling: [like]

For sake of illustration we calculate the mean value for \(Y(1)\)

idata_trt["posterior_predictive"]["like"].mean().item()

3.1565468847153166

The mean value for \(Y(0)\)

idata_ntrt["posterior_predictive"]["like"].mean().item()

0.8421930047703092

and their difference, which is the causal estimand of interest.

(
    idata_trt["posterior_predictive"]["like"].mean().item()
    - idata_ntrt["posterior_predictive"]["like"].mean().item()
)

2.314353879945007

In this way we can impute potential outcomes gain insight into any variety of causal estimands we care to calculate. In this case, we have re-derived the insight that the joint model yields a biased treatment effect estimate.

Nets Example#

Next we’ll asses a data set used by Andrew Heiss to demonstrate propensity score methods with brms.

nets_df = cp.load_data("nets")
nets_df["trt"] = nets_df["net_num"]
nets_df["outcome"] = nets_df["malaria_risk"]
nets_df.head()

	id	net	net_num	malaria_risk	income	health	household	eligible	temperature	resistance	trt	outcome
0	1	True	1	33	781	56	2	False	21.1	59	1	33
1	2	False	0	42	974	57	4	False	26.5	73	0	42
2	3	False	0	80	502	15	3	False	25.6	65	0	80
3	4	True	1	34	671	20	5	True	21.3	46	1	34
4	5	False	0	44	728	17	5	False	19.2	54	0	44

coords = {
    "betas": ["trt", "income"],
    "betas_trt": ["income", "temperature", "health"],
    "obs": range(nets_df.shape[0]),
}

# Process and Standardise Inputs
N = nets_df.shape[0]
X_trt = nets_df[["income", "temperature", "health"]].values
X_trt = (X_trt - X_trt.mean(axis=0)) / X_trt.std(axis=0)
X_outcome = nets_df[["trt", "income"]].values
X_outcome = (X_outcome - X_outcome.mean(axis=0)) / X_outcome.std(axis=0)
T_data = nets_df["trt"].values
X_outcome[:, 0] = T_data
Y_data = nets_df["outcome"].values

priors = {
    "beta_": [0, 1],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [40, 30],
    "sigma": 15,
    "beta_ps": [0, 1],
}
net_model = make_joint_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors=priors, normal_outcome=False
)

with net_model:
    idata_net = pm.sample(tune=2000, target_accept=0.98)

idata_treatment_2s_net, idata_outcome_2s_net, treatment_model_net, outcome_model_net = (
    make_2step_model(
        X_trt, X_outcome, T_data, Y_data, coords, priors=priors, normal_outcome=False
    )
)

reg_model_nets, idata_outcome_simple_reg_nets = make_reg_model(
    X_outcome, Y_data, coords
)

ax = az.plot_forest(
    [idata_outcome_2s_net, idata_net, idata_outcome_simple_reg_nets],
    var_names=["beta_"],
    model_names=["2 Stage", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 4),
)

ax[0].axvline(-10, label="True Treatment Value", color="k")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");

../_images/37d6cf0d99defcc3440d123d2e1367fd2969c10617a5d5c4ffc82eff496b44ce.png

As before we have used both model specifications to derive estimates for the treatment effect. In this case we have allowed the outcome model access to only a single income predictor. And while both models seem to approximtely recover the reported -10 treatment estimate with a large degree of uncertainty. The modular 2 stage estimates pulls away from the joint model estimate.

		mean	sd	hdi_3%	hdi_97%	r_hat
1-stage-model	alpha_trt	-0.261	0.052	-0.358	-0.165	1.0
	beta_[trt]	-9.933	0.262	-10.410	-9.440	1.0
	beta_[income]	-10.097	0.221	-10.503	-9.680	1.0
	beta_ps	-16.194	0.654	-17.454	-15.004	1.0
	alpha_outcome	46.580	0.397	45.841	47.317	1.0
2-stage-model	beta_[trt]	-10.053	0.316	-10.640	-9.475	1.0
	beta_[income]	-10.593	0.192	-10.947	-10.232	1.0
	alpha_outcome	42.202	0.415	41.418	42.976	1.0
	beta_ps	-7.146	0.938	-8.913	-5.427	1.0
Simple Regression	beta_[trt]	-8.534	0.338	-9.198	-7.920	1.0
	beta_[income]	-11.515	0.169	-11.831	-11.193	1.0
	alpha_outcome	37.832	0.216	37.439	38.255	1.0

This kind of difference need not be very concerning but we should check if the bias stems from a difference in latent propensity scores as before.

idata_net["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values

array([0.181, 0.249, 0.31 , 0.369, 0.43 , 0.5  , 0.571, 0.643, 0.727])

The quantiles of the propensity distributions across both models seem different.

idata_treatment_2s_net["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values

array([0.239, 0.283, 0.317, 0.349, 0.379, 0.412, 0.449, 0.492, 0.556])

compare_propensity_dists(idata_treatment_2s_net, idata_net)

../_images/9e5dab90bb835d6b4e28a33f28895ac7ddcc3a0b4d3a73e276bb55f457fd3a17.png

Here we see some divergence between the propensity score estimates indicating that the regression adjustment approach in the 2 stage outcome model is correcting for bias in the joint-distribution. However both credible intervals contain the reported treatment effect of -10 so the skew is perhaps less concerning. This is despite the fact that we removed useful predictors temperature, health from the outcome model specification. The model leans on the information contained in the propensity score and weights the beta_ps information appropriately.

LaLonde Example#

The Lalonde Data set is famous because it highlights a problem with naive causal contrasts. It is discussed by Angrist and Pischke in their Mostly Harmless Econometrics as an example of how regression controls can tolerably address selection effects in a way similar to propensity score weighting. So we should hope the a well specified outcome model can identify the treatment effects plausibly here too.

lalonde = df = cp.load_data("lalonde")
lalonde[["hispan", "white"]] = pd.get_dummies(lalonde["race"], drop_first=True)
lalonde.dropna(inplace=True)
lalonde.head()

	rownames	treat	age	educ	race	married	nodegree	re78	hispan	white
0	NSW1	1	37	11	black	1	1	9930.0460	False	False
1	NSW2	1	22	9	hispan	0	1	3595.8940	True	False
2	NSW3	1	30	12	black	0	0	24909.4500	False	False
3	NSW4	1	27	11	black	0	1	7506.1460	False	False
4	NSW5	1	33	8	black	0	1	289.7899	False	False

lalonde.groupby("treat")["re78"].mean().diff()

treat
0           NaN
1   -635.026212
Name: re78, dtype: float64

The Naive group difference suggests a negative effect. Lets see how our two models work when we remove predictors from the propensity model?

coords = {
    "betas": ["treat", "nodegree", "married"],
    "betas_trt": ["age", "educ", "hispan", "white", "married", "nodegree"],
    "obs": range(lalonde.shape[0]),
}

N = lalonde.shape[0]
X_trt = (
    lalonde[["age", "educ", "hispan", "white", "married", "nodegree"]]
    .astype(np.int32)
    .values
)
X_trt = (X_trt - X_trt.mean(axis=0)) / X_trt.std(axis=0)

X_outcome = lalonde[["treat", "nodegree", "married"]].astype(np.int32).values
# X_outcome = (X_outcome - X_outcome.mean()) / X_outcome.std()
T_data = lalonde["treat"].values
Y_data = lalonde["re78"].values


priors = {
    "beta_": [0, 4000],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [2000, 500],
    "sigma": 500,
    "beta_ps": [0, 30],
}

lalonde_model = make_joint_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors, noncentred=False
)

with lalonde_model:
    idata_lalonde = pm.sample(tune=5000)

(
    idata_treatment_2s_lalonde,
    idata_outcome_2s_lalonde,
    treatment_model_lalonde,
    outcome_model_lalonde,
) = make_2step_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors=priors,
    spline_component=True,
    winsorize_boundary=0.1,
)

reg_model_lalonde, idata_outcome_simple_reg_lalonde = make_reg_model(
    X_outcome, Y_data, coords, priors=priors
)

ax = az.plot_forest(
    [idata_outcome_2s_lalonde, idata_lalonde, idata_outcome_simple_reg_lalonde],
    var_names=["beta_", "sigma"],
    model_names=["2 Stage", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 4),
)
# Experimental benchmark between 1600 - 1800
ax[0].axvline(1700, color="black")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");

../_images/2435e87c13f5a29b3b1c17c2b909ff40632afdda754c567e6b3f452d8124c4b5.png

		mean	sd	hdi_3%	hdi_97%	r_hat
1-stage-model	alpha_trt	-1.360	0.136	-1.607	-1.096	1.0
	beta_[treat]	1596.367	592.252	522.004	2698.841	1.0
	beta_[nodegree]	-478.468	494.181	-1411.118	431.650	1.0
	beta_[married]	3941.883	521.193	2935.781	4905.149	1.0
	beta_ps	1.515	30.782	-59.159	56.783	1.0
	alpha_outcome	4318.250	369.288	3624.124	5022.575	1.0
	sigma	6567.379	152.789	6262.905	6837.376	1.0
2-stage-model	beta_[treat]	1588.023	560.154	606.842	2705.402	1.0
	beta_[nodegree]	-459.243	493.565	-1378.890	456.611	1.0
	beta_[married]	3938.877	513.101	2988.612	4920.545	1.0
	beta_ps	-0.041	29.752	-56.003	56.809	1.0
	alpha_outcome	4311.600	368.082	3627.173	5011.784	1.0
	sigma	6569.193	153.745	6279.593	6855.407	1.0
Simple Regression	beta_[treat]	1582.104	582.999	493.442	2672.645	1.0
	beta_[nodegree]	-454.568	492.429	-1360.717	461.760	1.0
	beta_[married]	3926.351	519.744	2930.813	4873.111	1.0
	alpha_outcome	4320.328	368.753	3628.037	4996.010	1.0
	sigma	6568.130	149.699	6293.193	6860.301	1.0

The model estimates are basically identical. We should expect mirrored propensity score distributions indicating that feedback wasn’t an issue. The outcome model leaned primarily on the covariate profile \(X\) to derive the treatment effect estimate.

compare_propensity_dists(idata_treatment_2s_lalonde, idata_lalonde)

../_images/1ef53abd8cddb16b74a68e77e7e471d166249810ac747290a23223e5f50e61a4.png

Both estimates of the treatment effects accord well with values in the literature. We can be happy that both models are picking up on the treatment effects reasonably well, despite having a minimalist outcome model. This is because in the Bayesian setting the conditional distribution of \(X, T\) are generally sufficient when there is no unmeasured confounders. Here the model somewhat ignores the propensity score coefficients beta_ps.

NHEFS#

Finally we turn to the NHEFS data as discussed by Hernan in [Hernan, 2024]. This data is known to be have a complex covariate profile for measuring aspects smokers health. We might suspect that there is some unmeasured confounding in this data set that would be hard to pick up on with simple regression controls.

df = cp.load_data("nhefs")
df[["age", "race", "trt", "smokeintensity", "smokeyrs", "outcome"]].head()

	age	race	smokeintensity	smokeyrs	outcome
0	42	1	30	29	-10.093960
1	36	0	20	24	2.604970
2	56	1	20	26	9.414486
3	68	1	3	53	4.990117
4	40	0	20	19	4.989251

coords = {
    "betas": ["trt", "age", "race", "sex", "smokeintensity", "smokeyrs", "wt71"],
    "betas_trt": [
        "age",
        "race",
        "sex",
        "smokeintensity",
        "smokeyrs",
        "wt71",
        "active_1",
        "active_2",
        "education_2",
        "education_3",
        "education_4",
        "education_5",
        "exercise_1",
        "exercise_2",
        "age^2",
        "wt71^2",
        "smokeintensity^2",
        "smokeyrs^2",
    ],
    "obs": range(df.shape[0]),
}

N = df.shape[0]
X_trt = df[
    [
        "age",
        "race",
        "sex",
        "smokeintensity",
        "smokeyrs",
        "wt71",
        "active_1",
        "active_2",
        "education_2",
        "education_3",
        "education_4",
        "education_5",
        "exercise_1",
        "exercise_2",
        "age^2",
        "wt71^2",
        "smokeintensity^2",
        "smokeyrs^2",
    ]
]
X_trt = (X_trt - X_trt.mean(axis=0)) / X_trt.std(axis=0)
# Note the significantly reduced outcome model specification.
X_outcome = df[["trt", "age", "race", "sex", "smokeintensity", "smokeyrs", "wt71"]]
X_outcome = (X_outcome - X_outcome.mean(axis=0)) / X_outcome.std(axis=0)
T_data = df["trt"].values
X_outcome["trt"] = T_data
Y_data = df["outcome"].values

priors = {
    "beta_": [0, 3],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [0, 5],
    "sigma": 10,
    "beta_ps": [0, 10],
}

nhefs_model = make_joint_model(X_trt, X_outcome, T_data, Y_data, coords, priors)
with nhefs_model:
    idata_nhefs = pm.sample(**sampler_kwargs)


(
    idata_treatment_2s_nhefs,
    idata_outcome_2s_nhefs,
    treatment_model_nhefs,
    outcome_model_nhefs,
) = make_2step_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors=priors,
    spline_component=True,
)


reg_model_nhefs, idata_outcome_simple_reg_nhefs = make_reg_model(
    X_outcome, Y_data, coords, priors=priors
)

ax = az.plot_forest(
    [idata_outcome_2s_nhefs, idata_nhefs, idata_outcome_simple_reg_nhefs],
    var_names=["beta_"],
    model_names=["2 Stage", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 4),
)

ax[0].axvline(3.4, label="True Treatment Value", color="k")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");

../_images/1bbce149e62c5d718d201c969669dc293bc7042adfc4788fa4c6a12a43506597.png

Here we see the model specifications start to come apart slightly, although the effect of feedback seems to primarily influence the impact of age on the outcome rather than the treatment estimate.

		mean	sd	hdi_3%	hdi_97%	r_hat
1-stage-model	alpha_trt	-1.146	0.061	-1.262	-1.034	1.0
	beta_[trt]	3.487	0.445	2.637	4.310	1.0
	beta_[age]	-1.018	0.620	-2.187	0.128	1.0
	beta_[race]	-0.323	0.273	-0.822	0.195	1.0
	beta_[sex]	-1.145	0.295	-1.706	-0.599	1.0
	beta_[smokeintensity]	-0.431	0.319	-1.024	0.175	1.0
	beta_[smokeyrs]	-0.303	0.540	-1.318	0.702	1.0
	beta_[wt71]	-1.221	0.281	-1.742	-0.698	1.0
	alpha_outcome	5.204	1.002	3.223	7.004	1.0
	beta_ps	-13.476	3.820	-20.775	-6.420	1.0
2-stage-model	beta_[trt]	3.319	0.435	2.517	4.155	1.0
	beta_[age]	-1.937	0.469	-2.824	-1.068	1.0
	beta_[race]	-0.101	0.257	-0.575	0.392	1.0
	beta_[sex]	-0.940	0.255	-1.409	-0.454	1.0
	beta_[smokeintensity]	-0.052	0.255	-0.542	0.420	1.0
	beta_[smokeyrs]	0.317	0.424	-0.510	1.094	1.0
	beta_[wt71]	-1.393	0.222	-1.818	-0.980	1.0
	alpha_outcome	1.363	2.904	-4.256	6.557	1.0
	beta_ps	-0.023	10.048	-18.154	19.192	1.0
Simple Regression	beta_[trt]	3.270	0.435	2.450	4.087	1.0
	beta_[age]	-2.427	0.387	-3.142	-1.688	1.0
	beta_[race]	0.204	0.193	-0.141	0.577	1.0
	beta_[sex]	-0.685	0.218	-1.096	-0.278	1.0
	beta_[smokeintensity]	0.270	0.199	-0.088	0.666	1.0
	beta_[smokeyrs]	0.547	0.393	-0.203	1.286	1.0
	beta_[wt71]	-1.543	0.208	-1.932	-1.151	1.0
	alpha_outcome	1.795	0.216	1.371	2.184	1.0

compare_propensity_dists(idata_treatment_2s_nhefs, idata_nhefs)

../_images/ce3f52f80a96a0559f2d07637d66c2f680dfabc62c23b5f1ae6f86e0d77e1afd.png

Two-Stage Outcome Modelling with CausalPy#

Next we show how to achieve these steps with the simpler CausalPy experiment API.

formula = """trt ~ 1 + age + race + sex + smokeintensity + smokeyrs + wt71 + active_1 + active_2 + 
         education_2 + education_3 + education_4 + education_5 + exercise_1 + exercise_2"""

df_standardised = (df - df.mean(axis=0)) / df.std(axis=0)
df_standardised["trt"] = df["trt"]
df_standardised["outcome"] = df["outcome"]
result = cp.InversePropensityWeighting(
    df_standardised,
    formula=formula,
    outcome_variable="outcome",
    weighting_scheme="robust",  ## Will be used by plots after estimation if no other scheme is specified.
    model=cp.pymc_models.PropensityScore(
        sample_kwargs={
            "chains": 4,
            "tune": 2000,
            "draws": 10000,
            "target_accept": 0.95,
            "random_seed": 18,
            "progressbar": False,
            "mp_ctx": "spawn",
        },
    ),
)

result

Comparing Inverse Propensity Score Weighting and Covariate Adjustment#

The two step procedure doesn’t jusst apply for regression adjustment methods as we’ve seen here, but can be used to apply inverse weighting techniques too.

result.plot_ate(result.idata);

/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/IPython/core/events.py:82: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  func(*args, **kwargs)
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/IPython/core/pylabtools.py:170: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  fig.canvas.print_figure(bytes_io, **kw)

../_images/4593bbe2b3fb16f182124ca82164cb787af9201c3bf359fc07a15b15b0dd6386.png

which can be compared against the two-step regression adjustment here.

idata_outcome_cp, model_outcome_cp = result.model.fit_outcome_model(
    result.X_outcome,
    result.y,
    result.coords,
    priors={"b_outcome": [0, 3], "sigma": 10, "beta_ps": [0, 10]},
    noncentred=True,
    normal_outcome=True,
)

Yielding similar, but not identical results.

az.summary(idata_outcome_cp, var_names=["beta_"])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta_[Intercept]	2.093	1.344	-0.434	4.612	0.014	0.009	9380.0	14843.0	1.0
beta_[age]	-2.210	0.721	-3.543	-0.835	0.007	0.004	10668.0	17142.0	1.0
beta_[race]	0.151	0.312	-0.445	0.727	0.003	0.002	12175.0	20768.0	1.0
beta_[sex]	-0.797	0.409	-1.562	-0.024	0.004	0.002	11010.0	17907.0	1.0
beta_[smokeintensity]	0.178	0.337	-0.469	0.797	0.003	0.002	12015.0	19490.0	1.0
beta_[smokeyrs]	0.467	0.602	-0.682	1.575	0.006	0.003	11849.0	19291.0	1.0
beta_[wt71]	-1.509	0.214	-1.920	-1.112	0.001	0.001	33903.0	28828.0	1.0
beta_[active_1]	-0.539	0.214	-0.930	-0.122	0.001	0.001	29101.0	29625.0	1.0
beta_[active_2]	-0.116	0.207	-0.512	0.266	0.001	0.001	31383.0	29074.0	1.0
beta_[education_2]	0.416	0.253	-0.071	0.881	0.002	0.001	25487.0	28436.0	1.0
beta_[education_3]	0.398	0.275	-0.137	0.904	0.002	0.001	23112.0	26560.0	1.0
beta_[education_4]	0.360	0.249	-0.117	0.821	0.002	0.001	20161.0	26766.0	1.0
beta_[education_5]	0.030	0.302	-0.514	0.624	0.002	0.001	15996.0	23783.0	1.0
beta_[exercise_1]	0.209	0.279	-0.317	0.737	0.002	0.001	20495.0	26199.0	1.0
beta_[exercise_2]	0.272	0.332	-0.336	0.913	0.003	0.002	14760.0	21435.0	1.0
beta_[trt]	3.318	0.434	2.476	4.105	0.002	0.002	40864.0	30723.0	1.0

ax = az.plot_posterior(
    idata_outcome_cp,
    var_names=["beta_"],
    coords={"outcome_coeffs": ["trt"]},
    figsize=(10, 5),
    ref_val=3.5,
)
ax.set_title(
    "Treatment Effect Estimated using 2-stage \n Regression Adjustment Approach with reduced covariates"
);

../_images/56face19d3071af3497787ba61f465679b2736c155add462265adb781156c71b.png

Conclusion: Modularity as Causal Discipline#

When attempting to estimate treatment effects using Bayesian inference, a natural but risky strategy is to fit a joint model for both the treatment assignment and the outcome. That is, to specify a full model and infer the parameters of both components simultaneously.

However, this joint approach introduces a feedback loop: the outcome \(Y\) can influence the estimation of the treatment mechanism \(P(T | X)\). This violates the original logic of design-based inference, where treatment assignment should be modeled independently of the observed outcomes. This phenomenon is often subtle but can lead to biased treatment effect estimates.

Across several examples, we have shown that fitting a full joint model distorts the treatment effect estimate relative to a two-step (modular) approach. In other cases, joint and modular approaches yield nearly identical estimates — usually when the outcome is well-identified from covariates alone. With these observations in scope, we recommend that practitioners generally follow a two-step or modular approach. Either two-stage inverse propensity score weighting or regression adjustment with the propensity score as an additional covariate. Both methods are available now in CausalPy.

Framed this way we can see that joint model violates the temporal precedence of the treatment assignment and outcome process. The 2-stage Bayesian procedures ensure that the causal ordering encoded in the actual data generating process is respected in the estimation process. The confounding adjustment achieved with propensity score must occur without access to information about the outcome. A well-specified propensity score model can substantially improve causal estimates (as we’ve seen), especially when the outcome model is weak or mis-specified. This allows for a form of doubly-robust inference as discussed in [Aronow and Miller, 2019] and [Fan et al., 2023]. Propensity scores do not only serve to reduce dimensionality; they formalize the treatment mechanism and encode information that the outcome model might fail to recover. This explains their continued prominence in modern causal inference and usefulness in the Bayesian setting.

References#

[1]

P Aronow and B Miller. Foundations of Agnostic Statistics. Cambridge University Press, 2019.

[2] (1,2)

Li Fan, Ding Peng, and Mealli Fabrizia. Bayesian causal inference: a critical review. Philosophical Transactions of the Royal Society, 2023.

[3]

G. J. Griffith, T. T. Morris, M. J. Tudball, A. Herbert, G. Mancano, L. Pike, G. C. Sharp, J. Sterne, T. M. Palmer, G. Davey Smith, K. Tilling, L. Zuccolo, N. M. Davies, and G. & Hemani. Collider bias undermines our understanding of covid-19 disease risk and severity. Nature communications, 2020.

[4]

Miguel A. Hernan. Causal Inference: What If. CRC Press TBC, Boca Raton, 2024.

Watermark#

%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,xarray

Last updated: Tue Jul 29 2025

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

pytensor: 2.31.7
xarray  : 2025.7.0

matplotlib: 3.10.3
arviz     : 0.21.0
pandas    : 2.3.1
causalpy  : 0.4.2
patsy     : 1.0.1
pymc      : 5.23.0
numpy     : 2.3.1

Watermark: 2.5.0