import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
from matplotlib import pyplot as plt
from patsy import dmatrix

import causalpy as cp

## Setting for Mac OS spawning multi-process defaults on M1 chip
sampler_kwargs = {
    "tune": 2000,
    "draws": 2000,
    "target_accept": 0.95,
    "mp_ctx": "spawn",
    "random_seed": 1040,
}

The Paradox of Propensity Scores in Bayesian Inference#

In causal inference the role of the propensity score is often seen to be central. For instance, we’ve seen how the propensity score can be used with the cp.InversePropensityWeighting class to correct for a species of selection bias by re-weighting our outcome variable and calculating a causal contrast on the re-weighted scale. Additionally we can use the propensity score to visualise and diagnose problems of overlap or covariate balancing across treatment and control groups.

These properties give the propensity score a large role to play in design based approaches to causal inference. The focus there is on assessing aspects of the treatment allocation to ensure we have identifiability assurances for estimands of interest. What then is their role in model-based or analysis focused Bayesian methods?

When we use cp.InversePropensityWeighting to apply various re-weighting techniques we perform a two-step manoeuvre: 1. we estimate the propensity score and 2. apply the inverse-weighting of the score to transform our outcome variable and assess causal contrasts. But being good Bayesians, we might wonder why go to all this trouble? Can we not simply estimate a full-bayesian model of treatment and outcome simultaneously?

In this notebook we’ll show why we should be careful attempting to model the joint-distribution of the propensity score and the outcome variable, but still make good use of the propensity score.

Brief Digression on the Mathematics#

Consider that we have the following three variables:

\[ P(Y, T, X) \]

where \(Y\) is our outcome variable, \(T\) is our treatment variable, and \(X\) stands in for all other control variables in scope. Now define the propensity score

\[ e(X) = P(T | X)\]

and our outcome model

\[\begin{split}\begin{align*} P(Y \mid T, X) &= \frac{P(Y, T, X)}{P(T, X)} \cdot P(T, X) \\ &= \frac{P(Y \mid T, X) \cdot P(T \mid X) \cdot P(X)}{P(T \mid X) \cdot P(X)} \cdot P(T, X) \\ &= \frac{P(Y \mid T, X) \cdot e(X) \cdot P(X)}{e(X) \cdot P(X)} \cdot P(T, X) \end{align*} \end{split}\]

but now it’s clearer to see how the propensity score just cancels out. When we’re already conditioning on \(X, T\) the information in the propensity score is technically redundant in the Bayesian setting. Add the assumption of unconfoundedness or ignorability used in causal inference. We are arguing that there is no umeasured confounding so conditioning on \(X, T\) should be sufficient to identify the causal contrast of interest.

Structure of the Presentation#

We will first consider a simple simulated data set where we know the true values of the data generating process and then we’ll demonstrate how to fit the joint distribution of the propensity score and the outcome in a single joint distribution using PyMC. To contrast this we’ll show how to fit a two-stage version of the same model. In both models we assume independent priors between the regression for the treatment and the regression for the outcome variable. Nevertheless, we will show that the joint model exhibits a bias due to feedback when there is non-measured confounding i.e. when there is misspecified outcome model.

  • Model Specification

    • Specifying the Joint Model

    • Specifying the 2 Stage Model

  • Application to Simulated Outcome

  • Application to Mosquito Net Data

  • Application to Lalonde Data

  • Application to NHEFS data.

Note the presentation here owes a debt to the work of Fan Li in [Fan et al., 2023] and her presentation here. Additionally we drew on the work and data of Andrew Heiss here and Jordan Nafa and Andrew Heiss here.

Generate Some Data#

N = 4000
np.random.seed(1043)


def inv_logit(z):
    """Compute the inverse logit (sigmoid) of z."""
    return 1 / (1 + np.exp(-z))


df1 = pd.DataFrame(
    {
        "x1": np.random.normal(0, 1, N),
        "x2": np.random.normal(0, 1, N),
        "x3": np.random.normal(0, 1, N),
    }
)

TREATMENT_EFFECT = 2
df1["trt"] = np.random.binomial(1, inv_logit(df1["x1"] + 6 * df1["x2"] + 7 * df1["x3"]))
df1["outcome"] = (
    1
    + TREATMENT_EFFECT * df1["trt"]
    + df1["x1"]
    + 0.2 * df1["x2"]
    + -3 * df1["x3"]
    + np.random.normal(0, 1, N)
)


df1.head()
x1 x2 x3 trt outcome
0 0.570753 -1.372992 0.247824 0 1.114661
1 0.063973 0.904728 0.322222 1 2.952142
2 -0.356653 -0.188922 -1.369041 0 5.809954
3 -1.372285 -0.611158 -1.755242 0 5.509733
4 0.029955 1.773686 -0.455371 1 6.058045

Specifying the Joint Model#

Now we define a model context that fits our data simultaneously for treatment and outcome. We allow that the propensity score estimated in the treatment model is used in a non-parametric spline to flexibly inform the outcome variable. Note that we will allow for a richer model specification of the treatment than we do for the outcome model. This is to try and push the models to use the propensity score information to adjust for unmeasured confounding in the outcome model.

coords = {
    "betas": ["trt", "x1"],
    "betas_trt": ["x1", "x2", "x3"],
    "obs": range(df1.shape[0]),
}

N = df1.shape[0]
X_trt = df1[["x1", "x2", "x3"]].values
X_outcome = df1[["trt", "x1"]].values
T_data = df1["trt"].values
Y_data = df1["outcome"].values


def make_joint_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors={
        "beta_": [0, 1],
        "beta_trt": [0, 1],
        "alpha_trt": [0, 1],
        "alpha_outcome": [0, 1],
        "sigma": 1,
        "beta_ps": [0, 1],
    },
    noncentred=True,
    normal_outcome=True,
):
    with pm.Model(coords=coords) as model:
        X_data_trt = pm.Data("X", X_trt, dims=("obs", "beta_trt"))
        X_data_outcome = pm.Data("X_outcome", X_outcome, dims=("obs", "betas"))
        T_data_ = pm.Data("T", T_data, dims="obs")
        Y_data_ = pm.Data("Y", Y_data, dims="obs")

        if noncentred:
            mu_beta_trt, sigma_beta_trt = priors["beta_trt"]
            beta_trt_std = pm.Normal("beta_trt_std", 0, 1, dims="betas_trt")
            beta_trt = pm.Deterministic(
                "beta_trt_",
                mu_beta_trt + sigma_beta_trt * beta_trt_std,
                dims="betas_trt",
            )

            mu_beta, sigma_beta = priors["beta_"]
            beta_std = pm.Normal("beta_std", 0, 1, dims="betas")
            beta = pm.Deterministic(
                "beta_", mu_beta + sigma_beta * beta_std, dims="betas"
            )
        else:
            beta_trt = pm.Normal(
                "beta_trt_",
                priors["beta_trt"][0],
                priors["beta_trt"][1],
                dims="betas_trt",
            )

            beta = pm.Normal(
                "beta_", priors["beta_"][0], priors["beta_"][1], dims="betas"
            )

        beta_ps = pm.Normal("beta_ps", priors["beta_ps"][0], priors["beta_ps"][1])

        alpha_trt = pm.Normal(
            "alpha_trt", priors["alpha_trt"][0], priors["alpha_trt"][1]
        )
        mu_trt = alpha_trt + pm.math.dot(X_data_trt, beta_trt)
        p = pm.Deterministic("p", pm.math.invlogit(mu_trt), dims="obs")

        pm.Bernoulli("t_pred", p=p, observed=T_data_, dims="obs")

        alpha_outcome = pm.Normal(
            "alpha_outcome", priors["alpha_outcome"][0], priors["alpha_outcome"][1]
        )
        mu_outcome = alpha_outcome + pm.math.dot(X_data_outcome, beta) + beta_ps * p
        sigma = pm.HalfNormal("sigma", priors["sigma"])

        if normal_outcome:
            _ = pm.Normal("like", mu_outcome, sigma, observed=Y_data_, dims="obs")
        else:
            nu = pm.Exponential("nu", lam=1 / 10)
            _ = pm.StudentT(
                "like", nu=nu, mu=mu_outcome, sigma=sigma, observed=Y_data_, dims="obs"
            )

    return model


model = make_joint_model(X_trt, X_outcome, T_data, Y_data, coords)

pm.model_to_graphviz(model)
../_images/d086873eaa5cc5cfa413a14a4af222b81ead62efe520ecc4da10b9d537eb9620.svg

Note how the two likelihood terms are fit simultaneously.

Note

We are specifying the models in raw PyMC code for clarity, but we will bundle these methods into the neater CausalPy API below.

Specifying the 2 Stage Model#

Here we allow for a function that takes the same inputs but fits two separate models. First we fit the treatment model then store the idata_treatment this xarray object stores the posterior estimates for the propensity score. We pass this through to a second outcome model where we proceed to take a random draw from the posterior and pass it through to the outcome regression via a spline component. This allows us to express any non-linearity in the treatment effect. Additionally it can be seen as a way to augment the outcome model.

While theoretically the propensity score contains no extra information if we are already conditioning on \(X\), practically the literature reports that the propensity improves the stability of the causal estimates achievable in Bayesian causal modelling. Additionally we might want to separate covariates for predicting the outcome and the treatment. In this case, there may be extra information derived in the treatment model that be used to inform the outcome model.

def make_treatment_model(
    X_trt,
    T_data,
    coords,
    priors={
        "beta_": [0, 1],
        "beta_trt": [0, 1],
        "alpha_trt": [0, 1],
        "alpha_outcome": [0, 1],
    },
    noncentred=True,
):
    with pm.Model(coords=coords) as model_trt:
        X_data_trt = pm.Data("X", X_trt, dims=("obs", "betas_trt"))
        T_data_ = pm.Data("T", T_data, dims="obs")

        if noncentred:
            mu_beta_trt, sigma_beta_trt = priors["beta_trt"]
            beta_trt_std = pm.Normal("beta_trt_std", 0, 1, dims="betas_trt")
            beta_trt = pm.Deterministic(
                "beta_trt_",
                mu_beta_trt + sigma_beta_trt * beta_trt_std,
                dims="betas_trt",
            )

        else:
            beta_trt = pm.Normal(
                "beta_trt_",
                priors["beta_trt"][0],
                priors["beta_trt"][1],
                dims="betas_trt",
            )

        alpha_trt = pm.Normal(
            "alpha_trt", priors["alpha_trt"][0], priors["alpha_trt"][1]
        )
        mu_trt = alpha_trt + pm.math.dot(X_data_trt, beta_trt)
        p = pm.Deterministic("p", pm.math.invlogit(mu_trt), dims="obs")

        pm.Bernoulli("t_pred", p=p, observed=T_data_, dims="obs")
    return model_trt


def make_outcome_model(
    X_outcome,
    Y_data,
    coords,
    priors={
        "beta_": [0, 1],
        "beta_trt": [0, 1],
        "alpha_trt": [0, 1],
        "alpha_outcome": [0, 1],
        "sigma": 1,
        "beta_ps": [0, 1],
    },
    noncentred=True,
    spline_component=False,
    propensity_score_idata=None,
    normal_outcome=True,
    winsorize_boundary=0.0,
):
    propensity_scores = az.extract(propensity_score_idata)["p"]
    with pm.Model(coords=coords) as model_outcome:
        X_data_outcome = pm.Data("X_outcome", X_outcome, dims=("obs", "betas"))
        Y_data_ = pm.Data("Y", Y_data, dims="obs")

        if noncentred:
            mu_beta, sigma_beta = priors["beta_"]
            beta_std = pm.Normal("beta_std", 0, 1, dims="betas")
            beta = pm.Deterministic(
                "beta_", mu_beta + sigma_beta * beta_std, dims="betas"
            )
        else:
            beta = pm.Normal(
                "beta_", priors["beta_"][0], priors["beta_"][1], dims="betas"
            )

        beta_ps = pm.Normal("beta_ps", priors["beta_ps"][0], priors["beta_ps"][1])

        chosen = np.random.choice(range(propensity_scores.shape[1]))
        p = propensity_scores[:, chosen].values
        p = np.clip(p, winsorize_boundary, 1 - winsorize_boundary)

        alpha_outcome = pm.Normal(
            "alpha_outcome", priors["alpha_outcome"][0], priors["alpha_outcome"][1]
        )
        mu_outcome = alpha_outcome + pm.math.dot(X_data_outcome, beta) + beta_ps * p

        if spline_component:
            beta_ps_spline = pm.Normal(
                "beta_ps_spline", priors["beta_ps"][0], priors["beta_ps"][1], size=14
            )
            B = dmatrix(
                "bs(ps, knots=knots, degree=3, include_intercept=True, lower_bound=0, upper_bound=1) - 1",
                {"ps": p, "knots": np.linspace(0, 1, 10)},
            )
            B_f = np.asarray(B, order="F")
            splines_summed = pm.Deterministic(
                "spline_features", pm.math.dot(B_f, beta_ps_spline.T), dims="obs"
            )
            mu_outcome = (
                alpha_outcome + pm.math.dot(X_data_outcome, beta) + splines_summed
            )

        sigma = pm.HalfNormal("sigma", priors["sigma"])

        if normal_outcome:
            _ = pm.Normal("like", mu_outcome, sigma, observed=Y_data_, dims="obs")
        else:
            nu = pm.Exponential("nu", lam=1 / 10)
            _ = pm.StudentT(
                "like", nu=nu, mu=mu_outcome, sigma=sigma, observed=Y_data_, dims="obs"
            )

    return model_outcome


def make_2step_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors,
    spline_component=False,
    normal_outcome=True,
    winsorize_boundary=0.0,
):
    treatment_model = make_treatment_model(X_trt, T_data, coords, priors)
    with treatment_model:
        idata_treatment = pm.sample_prior_predictive()
        idata_treatment.extend(pm.sample(**sampler_kwargs))

    outcome_model = make_outcome_model(
        X_outcome,
        Y_data,
        coords,
        priors,
        spline_component=spline_component,
        propensity_score_idata=idata_treatment,
        normal_outcome=normal_outcome,
        winsorize_boundary=winsorize_boundary,
    )
    with outcome_model:
        idata_outcome = pm.sample_prior_predictive()
        idata_outcome.extend(pm.sample(**sampler_kwargs))

    return idata_treatment, idata_outcome, treatment_model, outcome_model


model_treatment = make_treatment_model(X_trt, T_data, coords)

pm.model_to_graphviz(model_treatment)
../_images/885a377b84692246a09a2a30e47663602ccbb6221af5e9016fb6cf3951effc9f.svg

Specifying a Simple Regression Model without Propensity Scores#

Now we specify a simple regression model which does not make use of the propensity score information. This model will be used to assess how much extra information is gleaned from the presence of the propensity score as covariate in our outcome model.

def make_reg_model(
    X_outcome,
    Y_data,
    coords,
    priors={
        "beta_": [0, 1],
        "alpha_outcome": [0, 1],
        "sigma": 1,
    },
    noncentred=True,
):
    with pm.Model(coords=coords) as reg_model:
        X_data_outcome = pm.Data("X_outcome", X_outcome, dims=("obs", "betas"))
        Y_data_ = pm.Data("Y", Y_data, dims="obs")

        if noncentred:
            mu_beta, sigma_beta = priors["beta_"]
            beta_std = pm.Normal("beta_std", 0, 1, dims="betas")
            beta = pm.Deterministic(
                "beta_", mu_beta + sigma_beta * beta_std, dims="betas"
            )
        else:
            beta = pm.Normal(
                "beta_", priors["beta_"][0], priors["beta_"][1], dims="betas"
            )
        alpha_outcome = pm.Normal(
            "alpha_outcome", priors["alpha_outcome"][0], priors["alpha_outcome"][1]
        )
        mu_outcome = alpha_outcome + pm.math.dot(X_data_outcome, beta)
        sigma = pm.HalfNormal("sigma", priors["sigma"])

        _ = pm.Normal("like", mu_outcome, sigma, observed=Y_data_, dims="obs")

        idata = pm.sample(**sampler_kwargs)

    return reg_model, idata

Putting it all Together#

We are now in a position to fit both joint and modular (2-stage) models to our simulated data. We are seeking to assess how the different approaches to incorporating propensity score information impacts the accuracy of the treatment effect estimate.

Note

We are using this snippet of code to pass the posterior distribution of the propensity score through to the outcome model and have it be sampled in the MCMC process of the outcome model.

chosen = np.random.choice(range(propensity_scores.shape[1]))
p = propensity_scores[:, chosen].values

This allows us to modularise the the fitting process but retain the useful information stored in the propensity score.

priors = {
    "beta_": [0, 3],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [0, 1],
    "sigma": 1,
    "beta_ps": [0, 1],
}

joint_model = make_joint_model(X_trt, X_outcome, T_data, Y_data, coords, priors=priors)

with joint_model:
    idata_joint = pm.sample(**sampler_kwargs)

(
    idata_treatment_2s_joint,
    idata_outcome_2s_joint,
    treatment_model_joint,
    outcome_model_joint,
) = make_2step_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors, spline_component=False
)

(
    idata_treatment_2s_joint_spline,
    idata_outcome_2s_joint_spline,
    treatment_model_joint_spline,
    outcome_model_joint_spline,
) = make_2step_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors, spline_component=True
)

reg_model, idata_outcome_simple_reg = make_reg_model(X_outcome, Y_data, coords)
Hide code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, beta_std, beta_ps, alpha_trt, alpha_outcome, sigma]
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 80 seconds.
Sampling: [alpha_trt, beta_trt_std, t_pred]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, alpha_trt]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 20 seconds.
Sampling: [alpha_outcome, beta_ps, beta_std, like, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, alpha_outcome, sigma]
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 16 seconds.
Sampling: [alpha_trt, beta_trt_std, t_pred]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, alpha_trt]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 18 seconds.
Sampling: [alpha_outcome, beta_ps, beta_ps_spline, beta_std, like, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, alpha_outcome, beta_ps_spline, sigma]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 37 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, alpha_outcome, sigma]
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 12 seconds.
ax = az.plot_forest(
    [
        idata_outcome_2s_joint,
        idata_outcome_2s_joint_spline,
        idata_joint,
        idata_outcome_simple_reg,
    ],
    var_names=["beta_"],
    model_names=["2 Stage", "2 Stage + Spline", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 5),
)

ax[0].axvline(2, label="True Treatment Value", color="k")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");
../_images/338f8bae146b188dade704e9d2a4018557c33c8b37ac7e8c79ab8e21c9a6c7b4.png
Hide code cell source
compare_estimate = pd.concat(
    {
        "1-stage-model": az.summary(
            idata_joint, var_names=["alpha_trt", "beta_", "beta_ps", "alpha_outcome"]
        ),
        "2-stage-model": az.summary(
            idata_outcome_2s_joint, var_names=["beta_", "alpha_outcome", "beta_ps"]
        ),
        "2-stage-model_spline": az.summary(
            idata_outcome_2s_joint_spline,
            var_names=["beta_", "alpha_outcome", "beta_ps"],
        ),
        "Simple Regression": az.summary(
            idata_outcome_simple_reg,
            var_names=["beta_", "alpha_outcome"],
        ),
    }
)
compare_estimate[["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]]
mean sd hdi_3% hdi_97% r_hat
1-stage-model alpha_trt 0.000 0.021 -0.040 0.041 1.0
beta_[trt] 2.314 0.057 2.206 2.419 1.0
beta_[x1] 1.296 0.070 1.167 1.428 1.0
beta_ps -15.297 0.294 -15.840 -14.738 1.0
alpha_outcome 8.507 0.161 8.216 8.821 1.0
2-stage-model beta_[trt] 2.058 0.189 1.696 2.398 1.0
beta_[x1] 1.155 0.041 1.079 1.235 1.0
alpha_outcome 3.067 0.060 2.953 3.176 1.0
beta_ps -4.147 0.210 -4.541 -3.759 1.0
2-stage-model_spline beta_[trt] 1.824 0.179 1.490 2.161 1.0
beta_[x1] 1.162 0.040 1.087 1.236 1.0
alpha_outcome 1.007 0.293 0.441 1.544 1.0
beta_ps -0.009 0.983 -1.797 1.880 1.0
Simple Regression beta_[trt] -1.308 0.085 -1.464 -1.144 1.0
beta_[x1] 1.107 0.043 1.029 1.190 1.0
alpha_outcome 2.665 0.060 2.549 2.773 1.0

Here the models fail to recover substantially similar and correct results, in particular the simple regression model is widely off. However the 2-stage models seem to perform better than the joint model specification. This is interesting and demonstrates a key property of propensity scores in the Bayesian setting. Propensity scores are useful correctives within a regression context but we need to be careful how the model is specified.

The Problem of Feedback#

The issue here is sometimes called Bayesian feedback or “collider bias via the likelihood” [Griffith et al., 2020], and it’s a key issue when trying to build joint models for causal inference in the Bayesian paradigm. Because we have fit the outcome and the treatment models simultaneously, and this means that the outcome can influence the posterior distribution of the parameters \(\beta\) in the treatment model and it violates the idea of design-before-analysis. We have here an apparent example of a slight bias due to this effect. The two stage modular approach seems to better recover the treatment effect reported in the literature and avoids the risk of collider bias i.e. in the modular implementation we are able to use the propensity score to adjust for accuracy and compensate for the missing variables x2 and x1.

💡 Key Take-away: With an underspecified outcome model, we may use a well specified propensity score for adjusting the model to retrieve accurate treatment effect estimates. However, this tends to breakdown if we have estimated both propensity score and outcome in a joint bayesian model due to feedback effects. The solution is to use the propensity score in a 2 stage fashion.

pm.model_to_graphviz(outcome_model_joint_spline)
../_images/538b8cefba3705ca7c01e6210df7ae80ac9a125c8bda35f6a8346190a5ae7497.svg

Propensity Score Quantiles - Joint Model#

We can see how the different model specifications have yielded distinct propensity score estimates - as the joint specification seems to compensate for missing covariates in the outcome model by adjusting the propensity score latent in the treatment model too. We can see the differences in the quantiles to highlight a numeric difference in the propensity score distributions between the two models.

idata_joint["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values
array([0.224, 0.308, 0.38 , 0.442, 0.502, 0.563, 0.624, 0.691, 0.769])

Propensity Score Quantiles - 2 Stage Modular Model#

The range an position of these quantiles is vastly different than the joint model specification.

idata_treatment_2s_joint["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values
array([0.   , 0.001, 0.014, 0.11 , 0.506, 0.886, 0.986, 0.999, 1.   ])

Comparing Propensity Score Skew#

We have seen how the treatment effect reported by both models differ and that the joint model exhibits a bias away from the true treatment effect. But we might want to see how this bias manifests in the propensity score distribution.

def compare_propensity_dists(idata_2s, idata_1s):
    fig, ax = plt.subplots(figsize=(10, 4))
    for i in range(100):
        s2 = idata_2s["posterior"].stack(z=("chain", "draw"))["p"][:, i]
        s1 = idata_1s["posterior"].stack(z=("chain", "draw"))["p"][:, i]
        if i == 0:
            ax.hist(
                s2,
                alpha=0.1,
                color="orange",
                label="Propensity Scores 2-stage Estimation",
            )
            # Pivoted to compare shape
            ax.hist(
                1 - s1,
                alpha=0.1,
                color="blue",
                label="Propensity Scores 1-stage Estimation",
            )
        else:
            ax.hist(s2, alpha=0.01, color="orange")
            # Pivoted to compare shape
            ax.hist(1 - s1, alpha=0.01, color="blue")
    ax.legend()
    ax.set_title(
        "Comparing Propensity Distributions \n Two-stage and Joint Estimation routines"
    )


compare_propensity_dists(idata_treatment_2s_joint, idata_joint)
../_images/5d85c9d83bd0687c0274b4b4626e9a353c6f81573a2efe4d249594e5bc9253d9.png

These are radically different models due to the feedback mechanism. How does this phenomena play out in some real world examples?

spline_features = az.extract(idata_outcome_2s_joint_spline)["spline_features"]

propensity_scores = az.extract(idata_treatment_2s_joint_spline)["p"]

fig, ax = plt.subplots(figsize=(10, 5))
for i in range(100):
    temp = pd.DataFrame(
        {"prop_score": propensity_scores[:, i], "spline": spline_features[:, i]}
    )
    temp.sort_values("prop_score", inplace=True)
    ax.plot(temp["prop_score"], temp["spline"], alpha=0.01, color="blue")

temp = pd.DataFrame(
    {
        "prop_score": propensity_scores.mean(axis=1),
        "spline": spline_features.mean(axis=1),
    }
)
temp.sort_values("prop_score", inplace=True)
ax.plot(temp["prop_score"], temp["spline"], color="k", label="Expected Value")
ax.set_title("Additive Spline Effect on the Propensity Score")
ax.legend()
ax.set_ylabel("Spline Feature Contribution")
ax.set_xlabel("Propensity Score");
../_images/dd6ba1fd971b277108abf544051687ffbaa5391c2630d7aeca557e1960f13294.png

Note

The idea of allowing a spline component to contribute to the mu_outcome or expectation of our model, is that in some cases there might be a non-linear relationship between the propensity for treatment and the outcome interest.

\[ \mu = \alpha + X\beta + f(p) \]

Incorporating the spline function of p grants our model this kind of flexibility to account for different response categories across levels of the propensity score.

Causal Estimate with Do-Operator#

We can also confirm the model implications through counterfactual imputation. This ties the Bayesian setting back to the potential outcome framework. The fundamental problem of causal inference, when seen as a missing data problem allows us to derive causal estimands through imputation of the potential outcomes. Here we “push forward” the posterior predictive distribution for \(Y\) under different treatment settings.

X_outcome_trt = X_outcome.copy()
X_outcome_trt[:, 0] = 1

X_outcome_ntrt = X_outcome.copy()
X_outcome_ntrt[:, 0] = 0

First we specify our counterfactual input data. Then we push them through the joint model distribution using the do-operator in PyMC to sample from the posterior predictive distribution giving us sample of the potential outcomes \(Y(1), Y(0)\)

with pm.do(
    joint_model,
    {"T": np.ones(len(df1), dtype=np.int32), "X_outcome": X_outcome_trt},
    prune_vars=True,
) as treatment_model:
    idata_trt = pm.sample_posterior_predictive(idata_joint, var_names=["like", "p"])

with pm.do(
    joint_model,
    {"T": np.zeros(len(df1), dtype=np.int32), "X_outcome": X_outcome_ntrt},
    prune_vars=True,
) as ntreatment_model:
    idata_ntrt = pm.sample_posterior_predictive(idata_joint, var_names=["like", "p"])
Sampling: [like]

Sampling: [like]

For sake of illustration we calculate the mean value for \(Y(1)\)

idata_trt["posterior_predictive"]["like"].mean().item()
3.1565468847153166

The mean value for \(Y(0)\)

idata_ntrt["posterior_predictive"]["like"].mean().item()
0.8421930047703092

and their difference, which is the causal estimand of interest.

(
    idata_trt["posterior_predictive"]["like"].mean().item()
    - idata_ntrt["posterior_predictive"]["like"].mean().item()
)
2.314353879945007

In this way we can impute potential outcomes gain insight into any variety of causal estimands we care to calculate. In this case, we have re-derived the insight that the joint model yields a biased treatment effect estimate.

Nets Example#

Next we’ll asses a data set used by Andrew Heiss to demonstrate propensity score methods with brms.

nets_df = cp.load_data("nets")
nets_df["trt"] = nets_df["net_num"]
nets_df["outcome"] = nets_df["malaria_risk"]
nets_df.head()
id net net_num malaria_risk income health household eligible temperature resistance trt outcome
0 1 True 1 33 781 56 2 False 21.1 59 1 33
1 2 False 0 42 974 57 4 False 26.5 73 0 42
2 3 False 0 80 502 15 3 False 25.6 65 0 80
3 4 True 1 34 671 20 5 True 21.3 46 1 34
4 5 False 0 44 728 17 5 False 19.2 54 0 44
coords = {
    "betas": ["trt", "income"],
    "betas_trt": ["income", "temperature", "health"],
    "obs": range(nets_df.shape[0]),
}

# Process and Standardise Inputs
N = nets_df.shape[0]
X_trt = nets_df[["income", "temperature", "health"]].values
X_trt = (X_trt - X_trt.mean(axis=0)) / X_trt.std(axis=0)
X_outcome = nets_df[["trt", "income"]].values
X_outcome = (X_outcome - X_outcome.mean(axis=0)) / X_outcome.std(axis=0)
T_data = nets_df["trt"].values
X_outcome[:, 0] = T_data
Y_data = nets_df["outcome"].values

priors = {
    "beta_": [0, 1],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [40, 30],
    "sigma": 15,
    "beta_ps": [0, 1],
}
net_model = make_joint_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors=priors, normal_outcome=False
)

with net_model:
    idata_net = pm.sample(tune=2000, target_accept=0.98)

idata_treatment_2s_net, idata_outcome_2s_net, treatment_model_net, outcome_model_net = (
    make_2step_model(
        X_trt, X_outcome, T_data, Y_data, coords, priors=priors, normal_outcome=False
    )
)

reg_model_nets, idata_outcome_simple_reg_nets = make_reg_model(
    X_outcome, Y_data, coords
)
Hide code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, beta_std, beta_ps, alpha_trt, alpha_outcome, sigma, nu]

Sampling 4 chains for 2_000 tune and 1_000 draw iterations (8_000 + 4_000 draws total) took 17 seconds.
Sampling: [alpha_trt, beta_trt_std, t_pred]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, alpha_trt]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 14 seconds.
Sampling: [alpha_outcome, beta_ps, beta_std, like, nu, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, alpha_outcome, sigma, nu]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 15 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, alpha_outcome, sigma]
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 11 seconds.
ax = az.plot_forest(
    [idata_outcome_2s_net, idata_net, idata_outcome_simple_reg_nets],
    var_names=["beta_"],
    model_names=["2 Stage", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 4),
)

ax[0].axvline(-10, label="True Treatment Value", color="k")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");
../_images/37d6cf0d99defcc3440d123d2e1367fd2969c10617a5d5c4ffc82eff496b44ce.png

As before we have used both model specifications to derive estimates for the treatment effect. In this case we have allowed the outcome model access to only a single income predictor. And while both models seem to approximtely recover the reported -10 treatment estimate with a large degree of uncertainty. The modular 2 stage estimates pulls away from the joint model estimate.

Hide code cell source
compare_estimate = pd.concat(
    {
        "1-stage-model": az.summary(
            idata_net,
            var_names=["alpha_trt", "beta_", "beta_ps", "alpha_outcome"],
        ),
        "2-stage-model": az.summary(
            idata_outcome_2s_net, var_names=["beta_", "alpha_outcome", "beta_ps"]
        ),
        "Simple Regression": az.summary(
            idata_outcome_simple_reg_nets, var_names=["beta_", "alpha_outcome"]
        ),
    }
)
compare_estimate[["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]]
mean sd hdi_3% hdi_97% r_hat
1-stage-model alpha_trt -0.261 0.052 -0.358 -0.165 1.0
beta_[trt] -9.933 0.262 -10.410 -9.440 1.0
beta_[income] -10.097 0.221 -10.503 -9.680 1.0
beta_ps -16.194 0.654 -17.454 -15.004 1.0
alpha_outcome 46.580 0.397 45.841 47.317 1.0
2-stage-model beta_[trt] -10.053 0.316 -10.640 -9.475 1.0
beta_[income] -10.593 0.192 -10.947 -10.232 1.0
alpha_outcome 42.202 0.415 41.418 42.976 1.0
beta_ps -7.146 0.938 -8.913 -5.427 1.0
Simple Regression beta_[trt] -8.534 0.338 -9.198 -7.920 1.0
beta_[income] -11.515 0.169 -11.831 -11.193 1.0
alpha_outcome 37.832 0.216 37.439 38.255 1.0

This kind of difference need not be very concerning but we should check if the bias stems from a difference in latent propensity scores as before.

idata_net["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values
array([0.181, 0.249, 0.31 , 0.369, 0.43 , 0.5  , 0.571, 0.643, 0.727])

The quantiles of the propensity distributions across both models seem different.

idata_treatment_2s_net["posterior"]["p"].quantile(
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
).round(3).values
array([0.239, 0.283, 0.317, 0.349, 0.379, 0.412, 0.449, 0.492, 0.556])
compare_propensity_dists(idata_treatment_2s_net, idata_net)
../_images/9e5dab90bb835d6b4e28a33f28895ac7ddcc3a0b4d3a73e276bb55f457fd3a17.png

Here we see some divergence between the propensity score estimates indicating that the regression adjustment approach in the 2 stage outcome model is correcting for bias in the joint-distribution. However both credible intervals contain the reported treatment effect of -10 so the skew is perhaps less concerning. This is despite the fact that we removed useful predictors temperature, health from the outcome model specification. The model leans on the information contained in the propensity score and weights the beta_ps information appropriately.

LaLonde Example#

The Lalonde Data set is famous because it highlights a problem with naive causal contrasts. It is discussed by Angrist and Pischke in their Mostly Harmless Econometrics as an example of how regression controls can tolerably address selection effects in a way similar to propensity score weighting. So we should hope the a well specified outcome model can identify the treatment effects plausibly here too.

lalonde = df = cp.load_data("lalonde")
lalonde[["hispan", "white"]] = pd.get_dummies(lalonde["race"], drop_first=True)
lalonde.dropna(inplace=True)
lalonde.head()
rownames treat age educ race married nodegree re74 re75 re78 hispan white
0 NSW1 1 37 11 black 1 1 0.0 0.0 9930.0460 False False
1 NSW2 1 22 9 hispan 0 1 0.0 0.0 3595.8940 True False
2 NSW3 1 30 12 black 0 0 0.0 0.0 24909.4500 False False
3 NSW4 1 27 11 black 0 1 0.0 0.0 7506.1460 False False
4 NSW5 1 33 8 black 0 1 0.0 0.0 289.7899 False False
lalonde.groupby("treat")["re78"].mean().diff()
treat
0           NaN
1   -635.026212
Name: re78, dtype: float64

The Naive group difference suggests a negative effect. Lets see how our two models work when we remove predictors from the propensity model?

coords = {
    "betas": ["treat", "nodegree", "married"],
    "betas_trt": ["age", "educ", "hispan", "white", "married", "nodegree"],
    "obs": range(lalonde.shape[0]),
}

N = lalonde.shape[0]
X_trt = (
    lalonde[["age", "educ", "hispan", "white", "married", "nodegree"]]
    .astype(np.int32)
    .values
)
X_trt = (X_trt - X_trt.mean(axis=0)) / X_trt.std(axis=0)

X_outcome = lalonde[["treat", "nodegree", "married"]].astype(np.int32).values
# X_outcome = (X_outcome - X_outcome.mean()) / X_outcome.std()
T_data = lalonde["treat"].values
Y_data = lalonde["re78"].values


priors = {
    "beta_": [0, 4000],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [2000, 500],
    "sigma": 500,
    "beta_ps": [0, 30],
}

lalonde_model = make_joint_model(
    X_trt, X_outcome, T_data, Y_data, coords, priors, noncentred=False
)

with lalonde_model:
    idata_lalonde = pm.sample(tune=5000)

(
    idata_treatment_2s_lalonde,
    idata_outcome_2s_lalonde,
    treatment_model_lalonde,
    outcome_model_lalonde,
) = make_2step_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors=priors,
    spline_component=True,
    winsorize_boundary=0.1,
)

reg_model_lalonde, idata_outcome_simple_reg_lalonde = make_reg_model(
    X_outcome, Y_data, coords, priors=priors
)
Hide code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_, beta_, beta_ps, alpha_trt, alpha_outcome, sigma]
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 5_000 tune and 1_000 draw iterations (20_000 + 4_000 draws total) took 6 seconds.
Sampling: [alpha_trt, beta_trt_std, t_pred]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, alpha_trt]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 13 seconds.
Sampling: [alpha_outcome, beta_ps, beta_ps_spline, beta_std, like, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, alpha_outcome, beta_ps_spline, sigma]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 14 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, alpha_outcome, sigma]
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 11 seconds.
ax = az.plot_forest(
    [idata_outcome_2s_lalonde, idata_lalonde, idata_outcome_simple_reg_lalonde],
    var_names=["beta_", "sigma"],
    model_names=["2 Stage", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 4),
)
# Experimental benchmark between 1600 - 1800
ax[0].axvline(1700, color="black")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");
../_images/2435e87c13f5a29b3b1c17c2b909ff40632afdda754c567e6b3f452d8124c4b5.png
Hide code cell source
compare_estimate = pd.concat(
    {
        "1-stage-model": az.summary(
            idata_lalonde,
            var_names=["alpha_trt", "beta_", "beta_ps", "alpha_outcome", "sigma"],
        ),
        "2-stage-model": az.summary(
            idata_outcome_2s_lalonde,
            var_names=["beta_", "beta_ps", "alpha_outcome", "sigma"],
        ),
        "Simple Regression": az.summary(
            idata_outcome_simple_reg_lalonde,
            var_names=["beta_", "alpha_outcome", "sigma"],
        ),
    }
)
compare_estimate[["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]]
mean sd hdi_3% hdi_97% r_hat
1-stage-model alpha_trt -1.360 0.136 -1.607 -1.096 1.0
beta_[treat] 1596.367 592.252 522.004 2698.841 1.0
beta_[nodegree] -478.468 494.181 -1411.118 431.650 1.0
beta_[married] 3941.883 521.193 2935.781 4905.149 1.0
beta_ps 1.515 30.782 -59.159 56.783 1.0
alpha_outcome 4318.250 369.288 3624.124 5022.575 1.0
sigma 6567.379 152.789 6262.905 6837.376 1.0
2-stage-model beta_[treat] 1588.023 560.154 606.842 2705.402 1.0
beta_[nodegree] -459.243 493.565 -1378.890 456.611 1.0
beta_[married] 3938.877 513.101 2988.612 4920.545 1.0
beta_ps -0.041 29.752 -56.003 56.809 1.0
alpha_outcome 4311.600 368.082 3627.173 5011.784 1.0
sigma 6569.193 153.745 6279.593 6855.407 1.0
Simple Regression beta_[treat] 1582.104 582.999 493.442 2672.645 1.0
beta_[nodegree] -454.568 492.429 -1360.717 461.760 1.0
beta_[married] 3926.351 519.744 2930.813 4873.111 1.0
alpha_outcome 4320.328 368.753 3628.037 4996.010 1.0
sigma 6568.130 149.699 6293.193 6860.301 1.0

The model estimates are basically identical. We should expect mirrored propensity score distributions indicating that feedback wasn’t an issue. The outcome model leaned primarily on the covariate profile \(X\) to derive the treatment effect estimate.

compare_propensity_dists(idata_treatment_2s_lalonde, idata_lalonde)
../_images/1ef53abd8cddb16b74a68e77e7e471d166249810ac747290a23223e5f50e61a4.png

Both estimates of the treatment effects accord well with values in the literature. We can be happy that both models are picking up on the treatment effects reasonably well, despite having a minimalist outcome model. This is because in the Bayesian setting the conditional distribution of \(X, T\) are generally sufficient when there is no unmeasured confounders. Here the model somewhat ignores the propensity score coefficients beta_ps.

NHEFS#

Finally we turn to the NHEFS data as discussed by Hernan in [Hernan, 2024]. This data is known to be have a complex covariate profile for measuring aspects smokers health. We might suspect that there is some unmeasured confounding in this data set that would be hard to pick up on with simple regression controls.

df = cp.load_data("nhefs")
df[["age", "race", "trt", "smokeintensity", "smokeyrs", "outcome"]].head()
age race trt smokeintensity smokeyrs outcome
0 42 1 0 30 29 -10.093960
1 36 0 0 20 24 2.604970
2 56 1 0 20 26 9.414486
3 68 1 0 3 53 4.990117
4 40 0 0 20 19 4.989251
coords = {
    "betas": ["trt", "age", "race", "sex", "smokeintensity", "smokeyrs", "wt71"],
    "betas_trt": [
        "age",
        "race",
        "sex",
        "smokeintensity",
        "smokeyrs",
        "wt71",
        "active_1",
        "active_2",
        "education_2",
        "education_3",
        "education_4",
        "education_5",
        "exercise_1",
        "exercise_2",
        "age^2",
        "wt71^2",
        "smokeintensity^2",
        "smokeyrs^2",
    ],
    "obs": range(df.shape[0]),
}

N = df.shape[0]
X_trt = df[
    [
        "age",
        "race",
        "sex",
        "smokeintensity",
        "smokeyrs",
        "wt71",
        "active_1",
        "active_2",
        "education_2",
        "education_3",
        "education_4",
        "education_5",
        "exercise_1",
        "exercise_2",
        "age^2",
        "wt71^2",
        "smokeintensity^2",
        "smokeyrs^2",
    ]
]
X_trt = (X_trt - X_trt.mean(axis=0)) / X_trt.std(axis=0)
# Note the significantly reduced outcome model specification.
X_outcome = df[["trt", "age", "race", "sex", "smokeintensity", "smokeyrs", "wt71"]]
X_outcome = (X_outcome - X_outcome.mean(axis=0)) / X_outcome.std(axis=0)
T_data = df["trt"].values
X_outcome["trt"] = T_data
Y_data = df["outcome"].values

priors = {
    "beta_": [0, 3],
    "beta_trt": [0, 1],
    "alpha_trt": [0, 1],
    "alpha_outcome": [0, 5],
    "sigma": 10,
    "beta_ps": [0, 10],
}

nhefs_model = make_joint_model(X_trt, X_outcome, T_data, Y_data, coords, priors)
with nhefs_model:
    idata_nhefs = pm.sample(**sampler_kwargs)


(
    idata_treatment_2s_nhefs,
    idata_outcome_2s_nhefs,
    treatment_model_nhefs,
    outcome_model_nhefs,
) = make_2step_model(
    X_trt,
    X_outcome,
    T_data,
    Y_data,
    coords,
    priors=priors,
    spline_component=True,
)


reg_model_nhefs, idata_outcome_simple_reg_nhefs = make_reg_model(
    X_outcome, Y_data, coords, priors=priors
)
Hide code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, beta_std, beta_ps, alpha_trt, alpha_outcome, sigma]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 56 seconds.
Sampling: [alpha_trt, beta_trt_std, t_pred]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_trt_std, alpha_trt]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 40 seconds.
Sampling: [alpha_outcome, beta_ps, beta_ps_spline, beta_std, like, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, alpha_outcome, beta_ps_spline, sigma]

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 29 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, alpha_outcome, sigma]
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc/step_methods/hmc/quadpotential.py:316: RuntimeWarning: overflow encountered in dot
  return 0.5 * np.dot(x, v_out)

Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 13 seconds.
ax = az.plot_forest(
    [idata_outcome_2s_nhefs, idata_nhefs, idata_outcome_simple_reg_nhefs],
    var_names=["beta_"],
    model_names=["2 Stage", "1 Stage", "Simple Regression"],
    combined=True,
    figsize=(10, 4),
)

ax[0].axvline(3.4, label="True Treatment Value", color="k")
ax[0].set_title("Comparing Joint and 2 Stage Propensity Score Parameter Fits");
../_images/1bbce149e62c5d718d201c969669dc293bc7042adfc4788fa4c6a12a43506597.png

Here we see the model specifications start to come apart slightly, although the effect of feedback seems to primarily influence the impact of age on the outcome rather than the treatment estimate.

Hide code cell source
compare_estimate = pd.concat(
    {
        "1-stage-model": az.summary(
            idata_nhefs, var_names=["alpha_trt", "beta_", "alpha_outcome", "beta_ps"]
        ),
        "2-stage-model": az.summary(
            idata_outcome_2s_nhefs, var_names=["beta_", "alpha_outcome", "beta_ps"]
        ),
        "Simple Regression": az.summary(
            idata_outcome_simple_reg_nhefs, var_names=["beta_", "alpha_outcome"]
        ),
    }
)
compare_estimate[["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]]
mean sd hdi_3% hdi_97% r_hat
1-stage-model alpha_trt -1.146 0.061 -1.262 -1.034 1.0
beta_[trt] 3.487 0.445 2.637 4.310 1.0
beta_[age] -1.018 0.620 -2.187 0.128 1.0
beta_[race] -0.323 0.273 -0.822 0.195 1.0
beta_[sex] -1.145 0.295 -1.706 -0.599 1.0
beta_[smokeintensity] -0.431 0.319 -1.024 0.175 1.0
beta_[smokeyrs] -0.303 0.540 -1.318 0.702 1.0
beta_[wt71] -1.221 0.281 -1.742 -0.698 1.0
alpha_outcome 5.204 1.002 3.223 7.004 1.0
beta_ps -13.476 3.820 -20.775 -6.420 1.0
2-stage-model beta_[trt] 3.319 0.435 2.517 4.155 1.0
beta_[age] -1.937 0.469 -2.824 -1.068 1.0
beta_[race] -0.101 0.257 -0.575 0.392 1.0
beta_[sex] -0.940 0.255 -1.409 -0.454 1.0
beta_[smokeintensity] -0.052 0.255 -0.542 0.420 1.0
beta_[smokeyrs] 0.317 0.424 -0.510 1.094 1.0
beta_[wt71] -1.393 0.222 -1.818 -0.980 1.0
alpha_outcome 1.363 2.904 -4.256 6.557 1.0
beta_ps -0.023 10.048 -18.154 19.192 1.0
Simple Regression beta_[trt] 3.270 0.435 2.450 4.087 1.0
beta_[age] -2.427 0.387 -3.142 -1.688 1.0
beta_[race] 0.204 0.193 -0.141 0.577 1.0
beta_[sex] -0.685 0.218 -1.096 -0.278 1.0
beta_[smokeintensity] 0.270 0.199 -0.088 0.666 1.0
beta_[smokeyrs] 0.547 0.393 -0.203 1.286 1.0
beta_[wt71] -1.543 0.208 -1.932 -1.151 1.0
alpha_outcome 1.795 0.216 1.371 2.184 1.0
compare_propensity_dists(idata_treatment_2s_nhefs, idata_nhefs)
../_images/ce3f52f80a96a0559f2d07637d66c2f680dfabc62c23b5f1ae6f86e0d77e1afd.png

Two-Stage Outcome Modelling with CausalPy#

Next we show how to achieve these steps with the simpler CausalPy experiment API.

formula = """trt ~ 1 + age + race + sex + smokeintensity + smokeyrs + wt71 + active_1 + active_2 + 
         education_2 + education_3 + education_4 + education_5 + exercise_1 + exercise_2"""

df_standardised = (df - df.mean(axis=0)) / df.std(axis=0)
df_standardised["trt"] = df["trt"]
df_standardised["outcome"] = df["outcome"]
result = cp.InversePropensityWeighting(
    df_standardised,
    formula=formula,
    outcome_variable="outcome",
    weighting_scheme="robust",  ## Will be used by plots after estimation if no other scheme is specified.
    model=cp.pymc_models.PropensityScore(
        sample_kwargs={
            "chains": 4,
            "tune": 2000,
            "draws": 10000,
            "target_accept": 0.95,
            "random_seed": 18,
            "progressbar": False,
            "mp_ctx": "spawn",
        },
    ),
)

result
Hide code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std]
Sampling 4 chains for 2_000 tune and 10_000 draw iterations (8_000 + 40_000 draws total) took 26 seconds.
Sampling: [beta_std, t_pred]
Sampling: [t_pred]
<causalpy.experiments.inverse_propensity_weighting.InversePropensityWeighting at 0x29d4acd70>

Comparing Inverse Propensity Score Weighting and Covariate Adjustment#

The two step procedure doesn’t jusst apply for regression adjustment methods as we’ve seen here, but can be used to apply inverse weighting techniques too.

result.plot_ate(result.idata);
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/IPython/core/events.py:82: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  func(*args, **kwargs)
/Users/nathanielforde/mambaforge/envs/CausalPy/lib/python3.13/site-packages/IPython/core/pylabtools.py:170: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  fig.canvas.print_figure(bytes_io, **kw)
../_images/4593bbe2b3fb16f182124ca82164cb787af9201c3bf359fc07a15b15b0dd6386.png

which can be compared against the two-step regression adjustment here.

idata_outcome_cp, model_outcome_cp = result.model.fit_outcome_model(
    result.X_outcome,
    result.y,
    result.coords,
    priors={"b_outcome": [0, 3], "sigma": 10, "beta_ps": [0, 10]},
    noncentred=True,
    normal_outcome=True,
)
Hide code cell output
Sampling: [beta_ps, beta_std, like, sigma]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_std, beta_ps, sigma]
Sampling 4 chains for 2_000 tune and 10_000 draw iterations (8_000 + 40_000 draws total) took 47 seconds.

Yielding similar, but not identical results.

az.summary(idata_outcome_cp, var_names=["beta_"])
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta_[Intercept] 2.093 1.344 -0.434 4.612 0.014 0.009 9380.0 14843.0 1.0
beta_[age] -2.210 0.721 -3.543 -0.835 0.007 0.004 10668.0 17142.0 1.0
beta_[race] 0.151 0.312 -0.445 0.727 0.003 0.002 12175.0 20768.0 1.0
beta_[sex] -0.797 0.409 -1.562 -0.024 0.004 0.002 11010.0 17907.0 1.0
beta_[smokeintensity] 0.178 0.337 -0.469 0.797 0.003 0.002 12015.0 19490.0 1.0
beta_[smokeyrs] 0.467 0.602 -0.682 1.575 0.006 0.003 11849.0 19291.0 1.0
beta_[wt71] -1.509 0.214 -1.920 -1.112 0.001 0.001 33903.0 28828.0 1.0
beta_[active_1] -0.539 0.214 -0.930 -0.122 0.001 0.001 29101.0 29625.0 1.0
beta_[active_2] -0.116 0.207 -0.512 0.266 0.001 0.001 31383.0 29074.0 1.0
beta_[education_2] 0.416 0.253 -0.071 0.881 0.002 0.001 25487.0 28436.0 1.0
beta_[education_3] 0.398 0.275 -0.137 0.904 0.002 0.001 23112.0 26560.0 1.0
beta_[education_4] 0.360 0.249 -0.117 0.821 0.002 0.001 20161.0 26766.0 1.0
beta_[education_5] 0.030 0.302 -0.514 0.624 0.002 0.001 15996.0 23783.0 1.0
beta_[exercise_1] 0.209 0.279 -0.317 0.737 0.002 0.001 20495.0 26199.0 1.0
beta_[exercise_2] 0.272 0.332 -0.336 0.913 0.003 0.002 14760.0 21435.0 1.0
beta_[trt] 3.318 0.434 2.476 4.105 0.002 0.002 40864.0 30723.0 1.0
ax = az.plot_posterior(
    idata_outcome_cp,
    var_names=["beta_"],
    coords={"outcome_coeffs": ["trt"]},
    figsize=(10, 5),
    ref_val=3.5,
)
ax.set_title(
    "Treatment Effect Estimated using 2-stage \n Regression Adjustment Approach with reduced covariates"
);
../_images/56face19d3071af3497787ba61f465679b2736c155add462265adb781156c71b.png

Conclusion: Modularity as Causal Discipline#

When attempting to estimate treatment effects using Bayesian inference, a natural but risky strategy is to fit a joint model for both the treatment assignment and the outcome. That is, to specify a full model and infer the parameters of both components simultaneously.

However, this joint approach introduces a feedback loop: the outcome \(Y\) can influence the estimation of the treatment mechanism \(P(T | X)\). This violates the original logic of design-based inference, where treatment assignment should be modeled independently of the observed outcomes. This phenomenon is often subtle but can lead to biased treatment effect estimates.

Across several examples, we have shown that fitting a full joint model distorts the treatment effect estimate relative to a two-step (modular) approach. In other cases, joint and modular approaches yield nearly identical estimates — usually when the outcome is well-identified from covariates alone. With these observations in scope, we recommend that practitioners generally follow a two-step or modular approach. Either two-stage inverse propensity score weighting or regression adjustment with the propensity score as an additional covariate. Both methods are available now in CausalPy.

Framed this way we can see that joint model violates the temporal precedence of the treatment assignment and outcome process. The 2-stage Bayesian procedures ensure that the causal ordering encoded in the actual data generating process is respected in the estimation process. The confounding adjustment achieved with propensity score must occur without access to information about the outcome. A well-specified propensity score model can substantially improve causal estimates (as we’ve seen), especially when the outcome model is weak or mis-specified. This allows for a form of doubly-robust inference as discussed in [Aronow and Miller, 2019] and [Fan et al., 2023]. Propensity scores do not only serve to reduce dimensionality; they formalize the treatment mechanism and encode information that the outcome model might fail to recover. This explains their continued prominence in modern causal inference and usefulness in the Bayesian setting.

References#

[1]

P Aronow and B Miller. Foundations of Agnostic Statistics. Cambridge University Press, 2019.

[2] (1,2)

Li Fan, Ding Peng, and Mealli Fabrizia. Bayesian causal inference: a critical review. Philosophical Transactions of the Royal Society, 2023.

[3]

G. J. Griffith, T. T. Morris, M. J. Tudball, A. Herbert, G. Mancano, L. Pike, G. C. Sharp, J. Sterne, T. M. Palmer, G. Davey Smith, K. Tilling, L. Zuccolo, N. M. Davies, and G. & Hemani. Collider bias undermines our understanding of covid-19 disease risk and severity. Nature communications, 2020.

[4]

Miguel A. Hernan. Causal Inference: What If. CRC Press TBC, Boca Raton, 2024.

Watermark#

%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,xarray
Last updated: Tue Jul 29 2025

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

pytensor: 2.31.7
xarray  : 2025.7.0

matplotlib: 3.10.3
arviz     : 0.21.0
pandas    : 2.3.1
causalpy  : 0.4.2
patsy     : 1.0.1
pymc      : 5.23.0
numpy     : 2.3.1

Watermark: 2.5.0