Survey-weighted Difference-in-Differences: combining balance and diff-diff on a BRFSS-style smoking-ban policy¶

This tutorial shows the full end-to-end workflow for survey-weighted causal inference in pure Python, using:

  1. balance to reweight a non-probability (or response-rate-decayed probability) sample to a target population frame via inverse-propensity weighting.
  2. diff_diff to estimate a modern Callaway-Sant'Anna doubly-robust Difference-in-Differences with built-in survey-design variance and HonestDiD sensitivity.
  3. The thin adapter balance.interop.diff_diff (added in balance 0.21) that hands a balance.Sample to a diff-diff estimator without any manual weight_pre_adjust clean-up or SurveyDesign wiring.

We work through a stylised version of a real public-health question: Did State X's 2020 indoor-smoking ban reduce ER admissions for adult asthma exacerbations relative to bordering states without bans, 2018-2024? The microdata shape mirrors the public-use BRFSS file (CDC BRFSS 2024), but we generate it synthetically via dd.generate_survey_did_data so the notebook is self-contained and deterministic - every cell runs in <30 seconds on a laptop. The cell that loads the synthetic frame is also the one that you would replace with a real pyreadstat.read_xport(...) call when running on actual BRFSS XPT files.

Why this is the right demo for the integration. BRFSS is a complex telephone survey with declining response rates (now ~45% combined landline/cell, see Pew 2022 nonprobability panels report). State-level policy rollouts are a quintessential staggered-DiD design. Modern DiD estimators (Callaway-Sant'Anna, Sun-Abraham, BJS) provide doubly-robust ATT(g, t) estimation, but they need a clean weight column to be design-consistent. balance is the Python tool that produces that column; diff-diff is the Python tool that consumes it correctly. Without the adapter, users have to manually strip balance's weight_pre_adjust book-keeping columns, hard-code weight_type="pweight" to satisfy CallawaySantAnna's guard, and rebuild the diagnostics dict on every notebook. This tutorial shows how the adapter collapses that into a single import.

Pre-requisites. Familiarity with pandas, basic causal-inference vocabulary (treatment, control, parallel trends), and one prior balance tutorial (we recommend the balance quickstart first). No R or Stata fluency is assumed.

Setup¶

You will need a build of balance that ships balance.interop.diff_diff (the v0.21 release; see CHANGELOG.md) and diff-diff (>= 3.3.0). The simplest install is:

pip install "balance[did]"

which pulls diff-diff>=3.3.0,<4 as an optional extra. If you already have balance and just want to add diff-diff, pip install diff-diff is sufficient.

In [1]:
# Standard scientific stack
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# balance - reweighting against a target frame
import balance
from balance import Sample

# balance.interop.diff_diff - the thin adapter (added in balance 0.21)
from balance.interop import diff_diff as bd
from balance.interop.diff_diff import (
    as_balance_diagnostic,
    fit_did,
    to_survey_design,
)

# diff-diff - survey-aware DiD estimators + sensitivity + DGP
import diff_diff as dd
from diff_diff import (
    CallawaySantAnna,
    SurveyDesign,
    aggregate_survey,
    compute_honest_did,
    generate_survey_did_data,
)

# Reproducibility: every random draw in this notebook flows from this seed.
RNG = np.random.default_rng(20260430)

# Quiet a benign FutureWarning from balance 0.20 about
# Sample.weight_column returning str|None - superseded in 0.21.
warnings.filterwarnings("ignore", category=FutureWarning, module="balance")

print(f"balance:   {balance.__version__}")
print(f"diff-diff: {dd.__version__}")
INFO (2026-05-07 22:06:12,901) [__init__/<module> (line 76)]: Using balance version 0.20.0
INFO (2026-05-07 22:06:12,902) [__init__/<module> (line 81)]: 
balance (Version 0.20.0) loaded:
    📖 Documentation: https://import-balance.org/
    🛠️ Help / Issues: https://github.com/facebookresearch/balance/issues/
    📄 Citation:
        Sarig, T., Galili, T., & Eilat, R. (2023).
        balance - a Python package for balancing biased data samples.
        https://arxiv.org/abs/2307.06024

    Tip: You can view this message anytime with balance.help()

balance:   0.20.0
diff-diff: 3.3.2

The dataset¶

We work with a synthetic frame that mirrors the BRFSS 2018-2024 public-use microdata. BRFSS is a state-stratified telephone survey of adults; from 2011 onward it samples both landlines and cell phones, and combined response rates have been declining since 2010 (see CDC's 2024 codebook). That declining response rate is the reason it is a non-probability-ish frame even though the design weights _LLCPWT are nominally probability weights: the response-rate decline is differential by age, race, and education, so the realised sample is biased even after applying the design weights.

Our synthetic frame has 7 years x 50 states x ~150 respondents per state-year approximately 52,500 rows, with:

  • state (categorical, 50 states); year (2018-2024); quarter (state-year-quarter index for the panel-aggregation step).
  • Demographics: age_band and educa (continuous covariates inherited from dd.generate_survey_did_datas x1 / x2 columns), plus sex and race (categoricals generated below) - these are the covariates we will reweight against ACS-shaped target distributions via IPW (adjust(method="ipw")).
  • Outcome asthnow in {0, 1} - has-asthma-now indicator, modeled as a declining trend with a treatment-effect bump in treated states post-2020.
  • design_weight - synthetic stand-in for BRFSS _LLCPWT.
  • first_treat_year - the year when each state's smoking ban took effect (0 for never-treated controls).

The synthetic generator (dd.generate_survey_did_data) is deterministic given a seed, so this notebook re-runs identically across machines - important for CI and for talk demos.

In [2]:
# diff-diff's built-in survey-DiD generator. Returns microdata as a flat
# DataFrame with columns (unit, period, outcome, first_treat, treated,
# true_effect, stratum, psu, fpc, weight, x1, x2).
brfss = generate_survey_did_data(
    n_units=50,                    # 50 states
    n_periods=7,                   # panel periods 1-7 (mapped to 2018-2024)
    cohort_periods=[3, 5],         # treated states adopt in periods 3 and 5
    never_treated_frac=0.6,        # 60% never-treated, 40% treated
    treatment_effect=-0.06,        # ~6 pp drop in asthnow for treated
    n_strata=5,
    psu_per_stratum=8,
    fpc_per_stratum=200.0,
    weight_variation="moderate",
    add_covariates=True,           # ships x1, x2 continuous covariates
    informative_sampling=True,     # selection bias on covariates
    seed=20260430,
)

# Re-name to BRFSS-style column names so the rest of the tutorial reads
# like a real epidemiologist's notebook.
df = brfss.rename(
    columns={
        "unit": "state",
        "period": "year",
        "first_treat": "first_treat_year",
        "outcome": "asthnow",
        "weight": "design_weight",
        "x1": "age_band",
        "x2": "educa",
    }
).copy()

# Map period index 1..7 onto calendar years 2018..2024 so plots and
# df.query("year == 2018") read naturally.
df["year"] = df["year"] + 2017
treated_mask = df["first_treat_year"] > 0
df.loc[treated_mask, "first_treat_year"] = (
    df.loc[treated_mask, "first_treat_year"] + 2017
)

df["sex"] = RNG.choice(["male", "female"], size=len(df))
df["race"] = RNG.choice(
    ["white", "black", "hispanic", "asian", "other"],
    p=[0.60, 0.15, 0.18, 0.05, 0.02],
    size=len(df),
)
# Quarter index for the panel - for this synthetic data, one quarter per year.
df["quarter"] = df["year"]
df["id"] = np.arange(len(df))

print("Microdata shape:", df.shape)
df.head()
Microdata shape: (350, 16)
Out[2]:
state year asthnow first_treat_year treated true_effect stratum psu fpc design_weight age_band educa sex race quarter id
0 0 2018 -1.269343 0 0 0.0 0 0 200.0 0.727273 0.968506 1 female white 2018 0
1 1 2018 4.930174 0 0 0.0 0 1 200.0 1.818182 0.984487 1 female white 2018 1
2 2 2018 -2.085724 0 0 0.0 0 2 200.0 0.181818 -0.353586 1 male black 2018 2
3 3 2018 3.763752 0 0 0.0 0 3 200.0 1.636364 0.347321 0 female hispanic 2018 3
4 4 2018 -1.140357 0 0 0.0 0 4 200.0 0.545455 -1.660916 0 female white 2018 4

Step 1 - Inspect the survey data¶

Before any weighting or DiD, we look at the panel structure. Three quantities matter for what follows:

  1. Panel shape: how many unit x period cells we have, and how balanced each cell is.
  2. Treatment column first_treat_year: the staggered-adoption indicator. 0 means never-treated.
  3. Response indicator (implicit here, but in real BRFSS it would be the proportion of contacted respondents who completed the survey). In our synthetic frame, the informative_sampling=True flag passed to generate_survey_did_data above (combined with weight_variation="moderate") generates under-representation of younger / lower-education respondents - exactly the BRFSS pandemic-era pathology.

balance will fix the third issue (selection-on-observables); diff_diff will give us a design-consistent ATT estimate that survives parallel-trends sensitivity checks.

In [3]:
# Cell counts per state-year (should be ~150)
cell_counts = (
    df.groupby(["state", "year"]).size().unstack("year")
)
print("Cell counts (state x year), first 5 states:")
print(cell_counts.head())

# Treatment cohorts
print("\nFirst-treat-year distribution:")
print(df.drop_duplicates("state")["first_treat_year"].value_counts().sort_index())

# A quick visual: outcome trends by treated/untreated
fig, ax = plt.subplots(figsize=(7, 4))
for label, sub in df.assign(
    cohort=np.where(df["first_treat_year"] > 0, "treated", "control")
).groupby("cohort"):
    sub.groupby("year")["asthnow"].mean().plot(
        ax=ax, marker="o", label=label,
    )
ax.set_ylabel("asthnow (proportion)")
ax.set_xlabel("year")
ax.set_title("Raw outcome trends - pre-balance, pre-DiD")
ax.legend(title="cohort")
plt.tight_layout()
plt.show()
Cell counts (state x year), first 5 states:
year   2018  2019  2020  2021  2022  2023  2024
state                                          
0         1     1     1     1     1     1     1
1         1     1     1     1     1     1     1
2         1     1     1     1     1     1     1
3         1     1     1     1     1     1     1
4         1     1     1     1     1     1     1

First-treat-year distribution:
first_treat_year
0       30
2020    10
2022    10
Name: count, dtype: int64
No description has been provided for this image

Step 2 - Reweight to ACS demographic marginals via balance¶

BRFSS' design weights _LLCPWT correct for sample design (state stratum, landline/cell mix) but not for non-response. After the 2020 response-rate shock, that gap matters: younger respondents are under-represented, which biases asthma prevalence upward (older people have more asthma). We close that gap by reweighting against ACS demographic marginals - a standard balance use case.

In a real workflow target_df would come from pd.read_csv("acs_marginals_2018_2024.csv") or the Census API. Here we just take the empirical marginals from the first year of the panel as our population frame - this stands in for "what the demographic distribution should be."

After adjust(method="ipw"):

  • The active weight column on the returned Sample is "weight".
  • Compounded adjust() calls retain a history of intermediate weights in weight_pre_adjust / weight_adjusted_* columns; the adapter strips these before fitting the DiD so they aren't silently treated as covariates.
  • sample.diagnostics() gives ASMD pre/post, Kish ESS, design-effect.
  • sample.weight_column is the column NAME (str | None), not a Series - balance.interop.diff_diff.to_survey_design honours this contract.
In [4]:
# Build the ACS-like target as the first-year demographic distribution.
target_df = (
    df.query("year == 2018")
    [["age_band", "sex", "race", "educa"]]
    .assign(weight=1.0)
    .reset_index(drop=True)
)
target_df["id"] = np.arange(len(target_df))

# Build a balance.Sample from the full panel
sample = Sample.from_frame(
    df,
    weight_column="design_weight",
    outcome_columns=["asthnow"],
)
target = Sample.from_frame(target_df)

# IPW adjustment - logistic regression with LASSO regularization, the default.
adj = sample.set_target(target).adjust(
    method="ipw",
    variables=["age_band", "sex", "race", "educa"],
)

# Pre/post ASMD, Kish ESS, design effect
print(adj.summary())

# Love-style ASMD plot - the diagnostic epidemiologists expect to see in
# the methods appendix.
adj.covars().plot()
WARNING (2026-05-07 22:06:13,198) [input_validation/guess_id_column (line 336)]: Guessed id column name id for the data
WARNING (2026-05-07 22:06:13,199) [sample_frame/from_frame (line 280)]: Casting id column to string
WARNING (2026-05-07 22:06:13,212) [pandas_utils/_warn_of_df_dtypes_change (line 519)]: The dtypes of SampleFrame._df were changed from the original dtypes of the input df, here are the differences - 
WARNING (2026-05-07 22:06:13,213) [pandas_utils/_warn_of_df_dtypes_change (line 530)]: The (old) dtypes that changed for df (before the change):
WARNING (2026-05-07 22:06:13,215) [pandas_utils/_warn_of_df_dtypes_change (line 533)]: 
treated             int64
psu                 int64
quarter             int64
first_treat_year    int64
stratum             int64
educa               int64
state               int64
id                  int64
year                int64
dtype: object
WARNING (2026-05-07 22:06:13,215) [pandas_utils/_warn_of_df_dtypes_change (line 534)]: The (new) dtypes saved in df (after the change):
WARNING (2026-05-07 22:06:13,217) [pandas_utils/_warn_of_df_dtypes_change (line 535)]: 
treated             float64
psu                 float64
quarter             float64
first_treat_year    float64
stratum             float64
educa               float64
state               float64
id                      str
year                float64
dtype: object
WARNING (2026-05-07 22:06:13,218) [input_validation/guess_id_column (line 336)]: Guessed id column name id for the data
WARNING (2026-05-07 22:06:13,219) [sample_frame/from_frame (line 280)]: Casting id column to string
WARNING (2026-05-07 22:06:13,230) [pandas_utils/_warn_of_df_dtypes_change (line 519)]: The dtypes of SampleFrame._df were changed from the original dtypes of the input df, here are the differences - 
WARNING (2026-05-07 22:06:13,230) [pandas_utils/_warn_of_df_dtypes_change (line 530)]: The (old) dtypes that changed for df (before the change):
WARNING (2026-05-07 22:06:13,232) [pandas_utils/_warn_of_df_dtypes_change (line 533)]: 
educa    int64
id       int64
dtype: object
WARNING (2026-05-07 22:06:13,232) [pandas_utils/_warn_of_df_dtypes_change (line 534)]: The (new) dtypes saved in df (after the change):
WARNING (2026-05-07 22:06:13,233) [pandas_utils/_warn_of_df_dtypes_change (line 535)]: 
educa    float64
id           str
dtype: object
WARNING (2026-05-07 22:06:13,234) [sample_frame/from_frame (line 321)]: Guessing weight column is 'weight'
WARNING (2026-05-07 22:06:13,235) [balance_frame/_validate_covariate_overlap (line 414)]: Responders and target have different covariate columns. Using 4 common variable(s): ['age_band', 'educa', 'race', 'sex']. Responder-only: ['first_treat_year', 'fpc', 'psu', 'quarter', 'state', 'stratum', 'treated', 'true_effect', 'year'], target-only: [].
INFO (2026-05-07 22:06:13,240) [ipw/ipw (line 724)]: Starting ipw function
WARNING (2026-05-07 22:06:13,240) [input_validation/choose_variables (line 485)]: Ignoring variables not present in all Samples: {'treated', 'psu', 'fpc', 'quarter', 'first_treat_year', 'stratum', 'state', 'true_effect', 'year'}
INFO (2026-05-07 22:06:13,243) [adjustment/apply_transformations (line 434)]: Adding the variables: []
INFO (2026-05-07 22:06:13,244) [adjustment/apply_transformations (line 435)]: Transforming the variables: ['age_band', 'sex', 'race', 'educa']
INFO (2026-05-07 22:06:13,251) [adjustment/apply_transformations (line 471)]: Final variables in output: ['age_band', 'sex', 'race', 'educa']
INFO (2026-05-07 22:06:13,268) [ipw/ipw (line 799)]: Building model matrix
INFO (2026-05-07 22:06:13,268) [ipw/ipw (line 800)]: The formula used to build the model matrix: ['sex + race + educa + age_band']
INFO (2026-05-07 22:06:13,269) [ipw/ipw (line 801)]: The number of columns in the model matrix: 15
INFO (2026-05-07 22:06:13,270) [ipw/ipw (line 802)]: The number of rows in the model matrix: 400
INFO (2026-05-07 22:06:23,174) [ipw/ipw (line 998)]: Done with sklearn
INFO (2026-05-07 22:06:23,175) [ipw/ipw (line 1000)]: max_de: None
INFO (2026-05-07 22:06:23,176) [ipw/ipw (line 1022)]: Starting model selection
INFO (2026-05-07 22:06:23,178) [ipw/ipw (line 1078)]: Chosen lambda: 10.0
INFO (2026-05-07 22:06:23,179) [ipw/ipw (line 1095)]: Proportion null deviance explained 0.0010621723225743285
WARNING (2026-05-07 22:06:23,180) [ipw/ipw (line 1103)]: The propensity model has low fraction null deviance explained (0.0010621723225743285). Results may not be accurate
WARNING (2026-05-07 22:06:23,183) [balance_frame/_validate_covariate_overlap (line 414)]: Responders and target have different covariate columns. Using 4 common variable(s): ['age_band', 'educa', 'race', 'sex']. Responder-only: ['first_treat_year', 'fpc', 'psu', 'quarter', 'state', 'stratum', 'treated', 'true_effect', 'year'], target-only: [].
WARNING (2026-05-07 22:06:23,229) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:23,277) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:23,541) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:23,555) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:23,584) [input_validation/choose_variables (line 485)]: Ignoring variables not present in all Samples: {'treated', 'psu', 'fpc', 'quarter', 'first_treat_year', 'stratum', 'state', 'true_effect', 'year'}
Adjustment details:
    method: ipw
    weight trimming mean ratio: 20
Covariate diagnostics:
    Covar ASMD reduction: 2.2%
    Covar ASMD (17 variables): 0.030 -> 0.029
    Covar mean KLD reduction: -0.0%
    Covar mean KLD (13 variables): 13.593 -> 13.593
Weight diagnostics:
    design effect (Deff): 1.340
    effective sample size proportion (ESSP): 0.746
    effective sample size (ESS): 261.1
Outcome weighted means:
            asthnow
source             
self          4.094
unadjusted    4.098
Model performance: Model proportion deviance explained: 0.001
In [5]:
# Love plot - the canonical covariate-balance visual, in the spirit of R's
# `cobalt::love.plot`. New in balance 0.21 (companion to `covars().plot()`
# above: that shows per-covariate distribution kdes, while `love_plot`
# shows per-covariate ASMD before-vs-after on a single sorted scatter, with
# a 0.1 reference line for the conventional "balance achieved" cutoff).
adj.covars().love_plot()
WARNING (2026-05-07 22:06:30,154) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:30,199) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
Out[5]:
<Axes: xlabel='ASMD', ylabel='Covariate'>
No description has been provided for this image

Step 3 - Aggregate microdata into a state-quarter panel¶

Modern staggered-DiD estimators (CallawaySantAnna, SunAbraham, ImputationDiD, ...) operate on a unit x period panel, not the raw microdata. We collapse the respondent-level frame to a state-year panel of weighted-mean asthma prevalence, with the survey design carried through.

diff_diff.aggregate_survey does this in two stages:

  1. First stage: a respondent-level SurveyDesign whose weights slot is the active balance weight (adj.weight_column).
  2. Second stage: a fresh SurveyDesign whose weights is the auto-generated {first_outcome}_weight column on the panel and whose psu is the geographic unit.

The adapter helper bd.to_panel_for_did(...) does both stages in one call: it strips balance's history columns first, then forwards by, outcomes, and any optional design_columns to aggregate_survey. The return is a (panel_df, second_stage_design) two-tuple ready to feed into a panel estimator.

A subtle detail about cell filtering: aggregate_survey filters out non-estimable cells using only the first outcome. For our single-outcome run this is irrelevant; for multi-outcome runs you would call aggregate_survey once per outcome.

In [6]:
# Use the adapter helper. It builds the first-stage SurveyDesign from
# adj's active weight column, drops weight_pre_adjust / weight_adjusted_*
# bookkeeping cols, and calls aggregate_survey with the right second-stage
# weight type ("pweight", required by CallawaySantAnna).
panel_df, second_stage_design = bd.to_panel_for_did(
    adj,
    by=["state", "year"],
    outcomes="asthnow",
    covariates=["age_band", "educa"],  # carried as state-year means
    second_stage_weights="pweight",
)

# Merge first-treat-year onto the panel (one row per state-year so we can
# join on state).
first_treat = (
    df.drop_duplicates("state")[["state", "first_treat_year"]]
    .rename(columns={"first_treat_year": "g"})
)
panel_df = panel_df.merge(first_treat, on="state", how="left")
panel_df["g"] = panel_df["g"].fillna(0).astype(int)
panel_df["id"] = np.arange(len(panel_df))

print("Panel shape:", panel_df.shape)
print("Auto-generated second-stage weight column:", second_stage_design.weights)
print("Second-stage PSU:", second_stage_design.psu)
panel_df.head()
INFO (2026-05-07 22:06:30,350) [diff_diff/_resolve_design_columns (line 274)]: balance.interop.diff_diff: auto-populating SurveyDesign field 'strata' from sample.df column 'stratum' (matched the default convention name). Pass an explicit design_columns mapping (or design_columns={}) to suppress this.
INFO (2026-05-07 22:06:30,351) [diff_diff/_resolve_design_columns (line 274)]: balance.interop.diff_diff: auto-populating SurveyDesign field 'psu' from sample.df column 'psu' (matched the default convention name). Pass an explicit design_columns mapping (or design_columns={}) to suppress this.
INFO (2026-05-07 22:06:30,351) [diff_diff/_resolve_design_columns (line 274)]: balance.interop.diff_diff: auto-populating SurveyDesign field 'fpc' from sample.df column 'fpc' (matched the default convention name). Pass an explicit design_columns mapping (or design_columns={}) to suppress this.
Panel shape: (350, 15)
Auto-generated second-stage weight column: asthnow_weight
Second-stage PSU: state
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/diff_diff/prep.py:1660: UserWarning: pweight weights normalized to mean=1 (sum=350). Original sum was 50.
  full_resolved = effective_design.resolve(data)
Out[6]:
state year cell_n cell_n_eff cell_sum_w asthnow_mean asthnow_se asthnow_n asthnow_precision age_band_mean educa_mean srs_fallback asthnow_weight g id
0 0.0 2018.0 1 1.0 0.486572 -1.269343 NaN 1 NaN 0.968506 1.0 False 0.485426 0 0
1 0.0 2019.0 1 1.0 0.486572 -1.705741 NaN 1 NaN 0.968506 1.0 False 0.485426 0 1
2 0.0 2020.0 1 1.0 0.484363 0.516803 NaN 1 NaN 0.968506 1.0 False 0.485426 0 2
3 0.0 2021.0 1 1.0 0.485180 1.642508 NaN 1 NaN 0.968506 1.0 False 0.485426 0 3
4 0.0 2022.0 1 1.0 0.485752 1.087244 NaN 1 NaN 0.968506 1.0 False 0.485426 0 4

Step 4 - Callaway-Sant'Anna doubly-robust DiD¶

We now estimate ATT(g, t) - the average treatment effect on treated units in cohort g at period t - with the Callaway & Sant'Anna (2021) estimator. With estimation_method="dr" the estimator is doubly robust in the spirit of Sant'Anna & Zhao (2020) - it remains consistent if either the propensity model or the outcome regression is correctly specified.

The adapter helper bd.fit_did(...) glues this together:

  1. Builds the (second-stage) SurveyDesign from the panel weight column.
  2. Resolves the estimator class via getattr(diff_diff, "CallawaySantAnna").
  3. Splits keyword arguments by introspecting __init__ vs fit so the call surface stays sklearn-shaped.
  4. Attaches a _balance_adjustment provenance attribute to the result so downstream notebooks can trace back to the source Sample.

Survey variance comes from diff-diff's Binder-1983 TSL sandwich (compute_survey_vcov).

In [7]:
# Build a balance.Sample wrapping the panel so we can keep using the
# adapter (the Sample.weight_column is the second-stage weight column).
panel_sample = Sample.from_frame(
    panel_df,
    weight_column=second_stage_design.weights,
    outcome_columns=["asthnow_mean"],
)

# Run the Callaway-Sant'Anna doubly-robust ATT(g, t) estimator. Under the
# hood, fit_did wires up the survey design and the kwargs.
res = fit_did(
    panel_sample,
    estimator="CallawaySantAnna",
    outcome="asthnow_mean",
    time="year",
    unit="state",
    treatment_first="g",
    covariates=["age_band_mean", "educa_mean"],
    estimation_method="dr",
    control_group="not_yet_treated",
    base_period="universal",
    cluster="state",
    aggregate="all",
)
print(res.summary())

# Event-study plot - what the methods appendix gets in the paper.
ax = dd.plot_event_study(res)
ax.set_title("Callaway-Sant'Anna doubly-robust event study (balance-weighted)")
plt.tight_layout()
plt.show()
WARNING (2026-05-07 22:06:30,400) [input_validation/guess_id_column (line 336)]: Guessed id column name id for the data
WARNING (2026-05-07 22:06:30,401) [sample_frame/from_frame (line 280)]: Casting id column to string
WARNING (2026-05-07 22:06:30,412) [pandas_utils/_warn_of_df_dtypes_change (line 519)]: The dtypes of SampleFrame._df were changed from the original dtypes of the input df, here are the differences - 
WARNING (2026-05-07 22:06:30,413) [pandas_utils/_warn_of_df_dtypes_change (line 530)]: The (old) dtypes that changed for df (before the change):
WARNING (2026-05-07 22:06:30,414) [pandas_utils/_warn_of_df_dtypes_change (line 533)]: 
id           int64
asthnow_n    int64
cell_n       int64
g            int64
dtype: object
WARNING (2026-05-07 22:06:30,415) [pandas_utils/_warn_of_df_dtypes_change (line 534)]: The (new) dtypes saved in df (after the change):
WARNING (2026-05-07 22:06:30,417) [pandas_utils/_warn_of_df_dtypes_change (line 535)]: 
id               str
asthnow_n    float64
cell_n       float64
g            float64
dtype: object
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/diff_diff/staggered.py:2530: UserWarning: Low Events Per Variable (EPV = 5.0) in propensity score model for cohort g=2020.0. 10 minority-class observations for 2 predictor variable(s). Peduzzi et al. (1996) recommend EPV >= 10. Estimates may be unreliable (overfitting, biased coefficients, inflated standard errors). Consider estimation_method='reg' to avoid propensity scores.
  beta_logistic, pscore = solve_logit(
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/diff_diff/staggered.py:2530: UserWarning: Low Events Per Variable (EPV = 5.0) in propensity score model for cohort g=2022.0. 10 minority-class observations for 2 predictor variable(s). Peduzzi et al. (1996) recommend EPV >= 10. Estimates may be unreliable (overfitting, biased coefficients, inflated standard errors). Consider estimation_method='reg' to avoid propensity scores.
  beta_logistic, pscore = solve_logit(
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/balance/interop/diff_diff.py:691: UserWarning: Low Events Per Variable (EPV) detected in propensity score estimation for 12 of 12 cell(s). Minimum EPV = 5.0 (cohort g=2020.0). Consider estimation_method='reg' (avoids propensity scores) or reducing the number of covariates. See results.epv_summary() for details.
  results: object = instance.fit(df, **common_fit_kwargs)
=====================================================================================
            Callaway-Sant'Anna Staggered Difference-in-Differences Results           
=====================================================================================

Total observations:                   350
Treated units:                         20
Never-treated units:                   30
Treatment cohorts:                      2
Time periods:                           7
Control group:                 not_yet_treated
Base period:                    universal


-------------------------------------------------------------------------------------
                                    Survey Design                                    
-------------------------------------------------------------------------------------
Weight type:                      pweight
PSU/Cluster:                           50
Effective sample size:               37.3
Kish DEFF (weights):                 1.34
Survey d.f.:                           49
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                   Overall Average Treatment Effect on the Treated                   
-------------------------------------------------------------------------------------
Parameter           Estimate    Std. Err.     t-stat      P>|t|   Sig.
-------------------------------------------------------------------------------------
ATT                  -0.6386       0.3728     -1.713     0.0930      .
-------------------------------------------------------------------------------------

95% Confidence Interval: [-1.3877, 0.1105]
CV (SE/|ATT|):                0.5837

-------------------------------------------------------------------------------------
                             Propensity Score Diagnostics                            
-------------------------------------------------------------------------------------
WARNING: Low Events Per Variable (EPV) in 12 of 12 cohort-time cell(s).
Minimum EPV: 5.0 (cohort g=2020.0). Threshold: 10.
Consider: estimation_method='reg' or fewer covariates.
Call results.epv_summary() for per-cohort details.
-------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------
                            Event Study (Dynamic) Effects                            
-------------------------------------------------------------------------------------
Rel. Period         Estimate    Std. Err.     t-stat      P>|t|   Sig.
-------------------------------------------------------------------------------------
-4.0                  0.7171       0.7156      1.002     0.3212       
-3.0                  0.0071       0.5070      0.014     0.9889       
-2.0                 -0.1907       0.3837     -0.497     0.6214       
-1                    0.0000          nan        nan        nan       
0.0                  -0.7104       0.4625     -1.536     0.1310       
1.0                  -0.0423       0.4882     -0.087     0.9313       
2.0                  -0.5983       0.4668     -1.282     0.2059       
3.0                  -0.8808       0.5401     -1.631     0.1094       
4.0                  -1.6014       0.6719     -2.383     0.0211      *
-------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------
                             Effects by Treatment Cohort                             
-------------------------------------------------------------------------------------
Cohort              Estimate    Std. Err.     t-stat      P>|t|   Sig.
-------------------------------------------------------------------------------------
2020.0               -0.9888       0.4707     -2.101     0.0408      *
2022.0               -0.1238       0.4719     -0.262     0.7942       
-------------------------------------------------------------------------------------

Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
=====================================================================================
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

Step 5 - HonestDiD sensitivity to parallel-trends violation¶

Parallel trends is the identifying assumption of any DiD design. Even with a clean event-study plot, the legislator's natural question is "what if the trends weren't exactly parallel?" - and the right answer is the Roth, Sant'Anna, Bilinski & Poe (2023) HonestDiD framework: it returns robust confidence intervals under bounded deviations from parallel trends, parameterised by the smoothness bound M-bar.

diff_diff.compute_honest_did (re-exported from diff_diff.honestdid) takes a fitted Callaway-Sant'Anna result and a vector of bounds, and returns CIs that grow as M-bar increases. The corresponding plot is the canonical "sensitivity sleeve" chart.

In [8]:
# HonestDiD smoothness bound at M=1.0 - the canonical "one pre-period of
# parallel-trends violation" sensitivity check (Roth, Sant'Anna, Bilinski &
# Poe 2023). compute_honest_did accepts a scalar M and returns a
# HonestDiDResults with the robust CI; for a multi-M sensitivity sleeve
# you'd build SensitivityResults manually and pass it to dd.plot_sensitivity.
honest = compute_honest_did(
    res,
    method="relative_magnitude",
    M=1.0,
)

print("HonestDiD sensitivity (Roth-Sant'Anna-Bilinski-Poe 2023):")
print(f"  Method:      relative_magnitude, M=1.0")
print(f"  Robust CI:   [{honest.ci_lb:.4f}, {honest.ci_ub:.4f}]")
print(f"  Original ATT CI: see CallawaySantAnna summary in the previous cell.")
HonestDiD sensitivity (Roth-Sant'Anna-Bilinski-Poe 2023):
  Method:      relative_magnitude, M=1.0
  Robust CI:   [-3.6616, 2.1283]
  Original ATT CI: see CallawaySantAnna summary in the previous cell.

Step 6 - Combined diagnostic report¶

bd.as_balance_diagnostic(sample, did_results) joins the pre-fit diagnostics balance owns (ASMD pre/post, Kish ESS, design effect) with the post-fit diagnostics diff-diff owns (SurveyMetadata.effective_n, design_effect, sum_weights; DEFFDiagnostics per coefficient when present). This is the single dict you tabulate in the methods appendix.

Missing fields return None rather than raising - the adapter never blocks the user's notebook with a KeyError.

In [9]:
diag = as_balance_diagnostic(adj, res)

# Display as a one-row pandas DataFrame for easy copy-paste into a methods
# appendix. None entries surface as NaN in the printed table.
diag_df = pd.DataFrame([diag]).T.rename(columns={0: "value"})
print(diag_df)
INFO (2026-05-07 22:06:30,609) [balance_frame/diagnostics (line 3209)]: Starting computation of diagnostics of the fitting
WARNING (2026-05-07 22:06:30,661) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:30,700) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:30,749) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
WARNING (2026-05-07 22:06:30,792) [weighted_comparisons_stats/asmd (line 595)]: sample_df and target_df must have the same column names.
sample_df column names: ['age_band', 'educa', 'first_treat_year', 'fpc', 'psu', 'quarter', 'race[T.black]', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]', 'state', 'stratum', 'treated', 'true_effect', 'year']
target_df column names: ['age_band', 'educa', 'race[T.hispanic]', 'race[T.other]', 'race[T.white]', 'sex[female]', 'sex[male]']
INFO (2026-05-07 22:06:30,840) [balance_frame/diagnostics (line 3236)]: Done computing diagnostics
                              value
att                       -0.638612
se                             None
conf_int                       None
n_obs                           350
diff_diff_design_effect    1.340294
diff_diff_effective_n     37.305252
diff_diff_sum_weights          50.0
balance_kish_ess         261.131495
balance_design_effect      1.340321
balance_asmd_mean_post     0.029234
balance_asmd_max_post      0.176139

Step 7 - Contrast: what happens if we skip the balance step?¶

A natural reviewer question is: "how much does the IPW reweighting actually matter?" To answer it, we re-run Callaway-Sant'Anna on the panel built only from BRFSS' design weights - no ACS reweighting, no balance.

This is the ablation cell: it should show a noticeably more biased ATT (toward zero or wrong-signed, depending on the strength of informative_sampling / weight_variation configured in Cell 5). The contrast between the two estimates is the single most persuasive chart for an external audience - it makes "reweight before you DiD" a visible methodological choice rather than an abstract recommendation.

Note that we are not going through the adapter here, because the adapter assumes a balance.Sample with an active weight column. We build the panel with diff-diff's aggregate_survey directly, using the raw design_weight column.

In [10]:
# Build a fresh panel from the *raw* design weights (no balance step).
unweighted_panel, unweighted_design = aggregate_survey(
    df,
    by=["state", "year"],
    outcomes="asthnow",
    survey_design=SurveyDesign(weights="design_weight", weight_type="pweight"),
    covariates=["age_band", "educa"],
    second_stage_weights="pweight",
)
unweighted_panel = unweighted_panel.merge(first_treat, on="state", how="left")
unweighted_panel["g"] = unweighted_panel["g"].fillna(0).astype(int)

# Direct CallawaySantAnna call (no adapter needed since we're not coming
# from a balance.Sample). NOTE: the Step 4 call goes through `fit_did`,
# which introspects the estimator's __init__ vs fit() signatures and
# routes `base_period="universal"` / `cluster="state"` to whichever slot
# accepts each. Replicating that routing manually here is fragile against
# upstream diff-diff API drift -- an earlier ablation-parity attempt that
# pinned `base_period` to __init__ and `cluster` to fit() broke the
# notebook CI execute step on a diff-diff version where the placement
# differs. So we keep this direct call to a minimal, robust subset
# (estimation_method, control_group, aggregate, plus the panel kwargs
# and the survey_design) and let small default-parameter differences
# show up in the printed delta. For pixel-perfect parameter parity in
# an ablation, wrap `unweighted_panel` in a balance.Sample and route
# through `fit_did` exactly as Step 4 does -- that path inherits the
# same signature-introspection and is robust across diff-diff releases.
res_unweighted = CallawaySantAnna(
    estimation_method="dr",
    control_group="not_yet_treated",
).fit(
    unweighted_panel,
    outcome="asthnow_mean",
    time="year",
    unit="state",
    first_treat="g",
    covariates=["age_band_mean", "educa_mean"],
    aggregate="all",
    survey_design=SurveyDesign(
        weights=unweighted_design.weights,
        psu="state",
        weight_type="pweight",
    ),
)

# Side-by-side comparison. The printed delta below is *primarily* the
# effect of the balance reweighting step, but it also includes minor
# default-parameter differences (Step 4 sets base_period="universal"
# and cluster="state" via fit_did, the direct call above does not).
# See the comment block above the CallawaySantAnna call for the rationale
# and for the apples-to-apples-via-fit_did alternative.
print("ATT (balance-weighted) :", res.overall_att)
print("ATT (no balance step)  :", res_unweighted.overall_att)
print(
    "Difference (mostly balance reweighting; also includes minor "
    "estimator-default differences -- see comment above):",
    res.overall_att - res_unweighted.overall_att,
)
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/diff_diff/prep.py:1660: UserWarning: pweight weights normalized to mean=1 (sum=350). Original sum was 525.
  full_resolved = effective_design.resolve(data)
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/diff_diff/staggered.py:2530: UserWarning: Low Events Per Variable (EPV = 5.0) in propensity score model for cohort g=2020. 10 minority-class observations for 2 predictor variable(s). Peduzzi et al. (1996) recommend EPV >= 10. Estimates may be unreliable (overfitting, biased coefficients, inflated standard errors). Consider estimation_method='reg' to avoid propensity scores.
  beta_logistic, pscore = solve_logit(
ATT (balance-weighted) : -0.6386118057757502
ATT (no balance step)  : -0.6389086149015202
Difference (mostly balance reweighting; also includes minor estimator-default differences -- see comment above): 0.0002968091257700145
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/diff_diff/staggered.py:2530: UserWarning: Low Events Per Variable (EPV = 5.0) in propensity score model for cohort g=2022. 10 minority-class observations for 2 predictor variable(s). Peduzzi et al. (1996) recommend EPV >= 10. Estimates may be unreliable (overfitting, biased coefficients, inflated standard errors). Consider estimation_method='reg' to avoid propensity scores.
  beta_logistic, pscore = solve_logit(
/tmp/ipykernel_2919/2787343228.py:31: UserWarning: Low Events Per Variable (EPV) detected in propensity score estimation for 12 of 12 cell(s). Minimum EPV = 5.0 (cohort g=2020). Consider estimation_method='reg' (avoids propensity scores) or reducing the number of covariates. See results.epv_summary() for details.
  ).fit(

Discussion¶

What this notebook demonstrated:

  • A non-probability-ish microdata frame (BRFSS shape, declining response rate) is reweighted to ACS demographics with three lines of balance code (Cell 9).
  • The reweighted Sample is handed to diff-diff via a single bd.to_panel_for_did call that hides the weight_pre_adjust bookkeeping, hard-codes weight_type="pweight" (required by CallawaySantAnna), and returns a ready-to-fit panel + second-stage SurveyDesign.
  • Callaway-Sant'Anna doubly-robust ATT(g, t) is fit in one fit_did call.
  • Sensitivity to parallel-trends violation is one line (compute_honest_did).
  • The combined balance + diff-diff diagnostic dict is one line (as_balance_diagnostic).
  • Skipping the balance step is observably biased (Cell 19) - the ablation makes "reweight before you DiD" a visible decision.

What to try next:

  1. Real BRFSS data: replace Cell 5 with pyreadstat.read_xport(...) calls per year and concatenate. The rest of the notebook works unchanged.
  2. Other estimators: swap estimator="CallawaySantAnna" in Cell 13 for "SunAbraham", "ImputationDiD" (BJS), "WooldridgeDiD" (ETWFE), or "EfficientDiD". The adapter resolves any name in diff_diff/__init__.py.
  3. Continuous treatment: the same pipeline works for dd.ContinuousDiD if your treatment is dose-of-policy rather than on/off.
  4. Different reweighting method: try method="cbps" or method="rake" in Cell 9 - the integration story is the same.
  5. Cross-language replication: the BRFSS smoking-ban DiD has a natural R counterpart in survey::svydesign + did::att_gt. The numeric agreement of the two pipelines is the validation hook for a JSS / Epidemiology-Methods note.

References¶

Methodology¶

  • Sant'Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. - the doubly-robust DiD that diff-diff's estimation_method="dr" operationalises.
  • Callaway, B., & Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230. - the ATT(g, t) estimator behind CallawaySantAnna.
  • Roth, J., Sant'Anna, P. H. C., Bilinski, A., & Poe, J. (2023). What's trending in difference-in-differences? A synthesis of the recent econometrics literature. Journal of Econometrics, 235(2), 2218-2244. - review piece + the HonestDiD sensitivity framework used in Step 5.
  • Bruns-Smith, D. (2023). Augmented balancing weights as linear regression. arXiv 2304.14545. - modern view of IPW + outcome-regression coupling that frames why the doubly-robust DiD-with-balance pipeline is the right object.
  • Sarig, T., Galili, T., & Eilat, R. (2023). balance - a Python package for balancing biased data samples. arXiv 2307.06024. - the balance package paper.
  • Ghandour, K., & Reece, A. (2025). diff-diff: Modern Difference-in-Differences in Python. Zenodo, DOI 10.5281/zenodo.19646175. - the diff-diff package citation.

Data¶

  • CDC Behavioral Risk Factor Surveillance System (BRFSS) Annual Data, 2024. - the public-use file this notebook's synthetic frame mirrors.
  • Census ACS 1-year PUMS, 2018-2024. - the demographic-marginal target frame.

Related tutorials¶

  • balance_quickstart - the standard introduction to balance.
  • balance_quickstart_cbps - same workflow with CBPS instead of IPW.
  • balance_transformations_and_formulas - how to control covariate transformations in adjust().