balance Quickstart (post-stratify): Matching known cell totals¶

This notebook demonstrates how to apply post-stratification with the balance package. We start by matching a single marginal distribution and then show how the same function can match the joint distribution of two variables without falling back to raking.

1. Load simulated data¶

The helper :func:balance.load_data function returns a pair of simulated datasets: the target population and a biased sample. We'll use unit design weights in this tutorial so the weighted sums can be interpreted as counts.

In [1]:
from balance import load_data
from balance.weighting_methods.poststratify import poststratify
import pandas as pd

target_df, sample_df = load_data()
target_df.head()
INFO (2026-04-18 20:45:18,634) [__init__/<module> (line 75)]: Using balance version 0.19.0
INFO (2026-04-18 20:45:18,635) [__init__/<module> (line 80)]: 
balance (Version 0.19.0) loaded:
    📖 Documentation: https://import-balance.org/
    🛠️ Help / Issues: https://github.com/facebookresearch/balance/issues/
    📄 Citation:
        Sarig, T., Galili, T., & Eilat, R. (2023).
        balance - a Python package for balancing biased data samples.
        https://arxiv.org/abs/2307.06024

    Tip: You can view this message anytime with balance.help()

Out[1]:
id gender age_group income happiness
0 100000 Male 45+ 10.183951 61.706333
1 100001 Male 45+ 6.036858 79.123670
2 100002 Male 35-44 5.226629 44.206949
3 100003 NaN 45+ 5.752147 83.985716
4 100004 NaN 25-34 4.837484 49.339713

2. Post-stratify on a single variable¶

We first adjust the sample so that its gender distribution matches the target population. Rows with missing gender are dropped to satisfy the default strict_matching=True requirement that every sample cell be present in the target.

In [2]:
sample_gender = sample_df.dropna(subset=["gender"])
target_gender = target_df.dropna(subset=["gender"])

gender_result = poststratify(
    sample_df=sample_gender[["gender"]],
    sample_weights=pd.Series(1, index=sample_gender.index),
    target_df=target_gender[["gender"]],
    target_weights=pd.Series(1, index=target_gender.index),
)

gender_weights = sample_gender.assign(weight=gender_result["weight"])
gender_summary = pd.concat(
    [
        gender_weights.groupby("gender")["weight"].sum().rename("weighted_sample"),
        target_gender.groupby("gender").size().rename("target_population"),
    ],
    axis=1,
)
gender_summary
INFO (2026-04-18 20:45:18,671) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-18 20:45:18,671) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender']
INFO (2026-04-18 20:45:18,674) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender']
Out[2]:
weighted_sample target_population
gender
Female 4551.0 4551
Male 4551.0 4551

The weighted sample counts now reproduce the target population counts. Dividing by the column totals would show that the sample proportions also match the target proportions.

3. Post-stratify on the joint distribution of two variables¶

Post-stratification can use multiple variables simultaneously. In that case the function computes a weight per cell defined by the unique combination of those variables. This is different from raking, which iteratively matches the marginals of each variable.

In [3]:
covariates = ["gender", "age_group"]
sample_cells = sample_df.dropna(subset=covariates)
target_cells = target_df.dropna(subset=covariates)

joint_result = poststratify(
    sample_df=sample_cells[covariates],
    sample_weights=pd.Series(1, index=sample_cells.index),
    target_df=target_cells[covariates],
    target_weights=pd.Series(1, index=target_cells.index),
)

joint_weights = sample_cells.assign(weight=joint_result["weight"])
joint_summary = pd.concat(
    [
        joint_weights.groupby(covariates)["weight"].sum().rename("weighted_sample"),
        target_cells.groupby(covariates).size().rename("target_population"),
    ],
    axis=1,
)
joint_summary
INFO (2026-04-18 20:45:18,699) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-18 20:45:18,700) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group']
INFO (2026-04-18 20:45:18,705) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group']
Out[3]:
weighted_sample target_population
gender age_group
Female 18-24 876.0 876
25-34 1360.0 1360
35-44 1370.0 1370
45+ 945.0 945
Male 18-24 905.0 905
25-34 1355.0 1355
35-44 1347.0 1347
45+ 944.0 944

Each row corresponds to a unique combination of gender and age group. The equality between the two columns confirms that poststratify matched the full joint distribution without reverting to raking.

Future extensions (TODO)¶

  • PS via Sample.adjust(): Show the high-level API (sample.adjust(target, method="poststratify")) instead of calling poststratify() directly.
  • Chained adjustments: Demonstrate IPW → poststratify (two-stage adjustment) with ASMD comparison at each stage. Example from balance notebook v03:
    sample_with_target = sample.set_target(target)
    adjust_stage_1 = sample_with_target.adjust(method="ipw")
    adjust_stage_2 = adjust_stage_1.adjust(method="poststratify")
    adjust_stage_2.covars().asmd().T
    adjust_stage_2.outcomes().plot()
    
  • Controlling PS variables via transformations: Show how transformations={"age_group": lambda x: x} limits PS cells to only the named column. Example from balance notebook v03:
    transformations = {"age_group": lambda x: x}
    adjusted = ipw_adjusted.adjust(method="poststratify", transformations=transformations)
    adjusted.covars().asmd().T
    
  • formula= does NOT work with poststratify yet: The following is silently ignored (formula goes into **kwargs but is never read). Show the working alternative:
    # Does NOT work as intended:
    adjusted = ipw_adjusted.adjust(method="poststratify", formula=["gender"])
    # Use instead:
    adjusted = ipw_adjusted.adjust(method="poststratify", variables=["gender"])
    
  • ASMD diagnostics: Show adjusted.covars().asmd().T and asmd_improvement() for poststratified samples.
  • Plotting: Add adjusted.covars().plot() and adjusted.outcomes().plot() examples.
  • PS on binned numeric variables: Demonstrate using transformations (e.g., pd.cut) to bin continuous variables before PS.
  • Strict vs non-strict matching: Show what happens when sample has cells not in target (strict_matching=False).
  • Weight trimming: Demonstrate weight_trimming_mean_ratio with poststratify.