balance Quickstart (post-stratify): Matching known cell totals¶
This notebook demonstrates how to apply post-stratification with the balance package. We start by matching a single marginal distribution and then show how the same function can match the joint distribution of two variables without falling back to raking.
1. Load simulated data¶
The helper :func:balance.load_data function returns a pair of simulated datasets: the target population and a biased sample. We'll use unit design weights in this tutorial so the weighted sums can be interpreted as counts.
from balance import load_data
from balance.weighting_methods.poststratify import poststratify
import pandas as pd
target_df, sample_df = load_data()
target_df.head()
INFO (2026-04-18 20:45:18,634) [__init__/<module> (line 75)]: Using balance version 0.19.0
INFO (2026-04-18 20:45:18,635) [__init__/<module> (line 80)]:
balance (Version 0.19.0) loaded:
📖 Documentation: https://import-balance.org/
🛠️ Help / Issues: https://github.com/facebookresearch/balance/issues/
📄 Citation:
Sarig, T., Galili, T., & Eilat, R. (2023).
balance - a Python package for balancing biased data samples.
https://arxiv.org/abs/2307.06024
Tip: You can view this message anytime with balance.help()
| id | gender | age_group | income | happiness | |
|---|---|---|---|---|---|
| 0 | 100000 | Male | 45+ | 10.183951 | 61.706333 |
| 1 | 100001 | Male | 45+ | 6.036858 | 79.123670 |
| 2 | 100002 | Male | 35-44 | 5.226629 | 44.206949 |
| 3 | 100003 | NaN | 45+ | 5.752147 | 83.985716 |
| 4 | 100004 | NaN | 25-34 | 4.837484 | 49.339713 |
2. Post-stratify on a single variable¶
We first adjust the sample so that its gender distribution matches the target population. Rows with missing gender are dropped to satisfy the default strict_matching=True requirement that every sample cell be present in the target.
sample_gender = sample_df.dropna(subset=["gender"])
target_gender = target_df.dropna(subset=["gender"])
gender_result = poststratify(
sample_df=sample_gender[["gender"]],
sample_weights=pd.Series(1, index=sample_gender.index),
target_df=target_gender[["gender"]],
target_weights=pd.Series(1, index=target_gender.index),
)
gender_weights = sample_gender.assign(weight=gender_result["weight"])
gender_summary = pd.concat(
[
gender_weights.groupby("gender")["weight"].sum().rename("weighted_sample"),
target_gender.groupby("gender").size().rename("target_population"),
],
axis=1,
)
gender_summary
INFO (2026-04-18 20:45:18,671) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-18 20:45:18,671) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender']
INFO (2026-04-18 20:45:18,674) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender']
| weighted_sample | target_population | |
|---|---|---|
| gender | ||
| Female | 4551.0 | 4551 |
| Male | 4551.0 | 4551 |
The weighted sample counts now reproduce the target population counts. Dividing by the column totals would show that the sample proportions also match the target proportions.
3. Post-stratify on the joint distribution of two variables¶
Post-stratification can use multiple variables simultaneously. In that case the function computes a weight per cell defined by the unique combination of those variables. This is different from raking, which iteratively matches the marginals of each variable.
covariates = ["gender", "age_group"]
sample_cells = sample_df.dropna(subset=covariates)
target_cells = target_df.dropna(subset=covariates)
joint_result = poststratify(
sample_df=sample_cells[covariates],
sample_weights=pd.Series(1, index=sample_cells.index),
target_df=target_cells[covariates],
target_weights=pd.Series(1, index=target_cells.index),
)
joint_weights = sample_cells.assign(weight=joint_result["weight"])
joint_summary = pd.concat(
[
joint_weights.groupby(covariates)["weight"].sum().rename("weighted_sample"),
target_cells.groupby(covariates).size().rename("target_population"),
],
axis=1,
)
joint_summary
INFO (2026-04-18 20:45:18,699) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-18 20:45:18,700) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group']
INFO (2026-04-18 20:45:18,705) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group']
| weighted_sample | target_population | ||
|---|---|---|---|
| gender | age_group | ||
| Female | 18-24 | 876.0 | 876 |
| 25-34 | 1360.0 | 1360 | |
| 35-44 | 1370.0 | 1370 | |
| 45+ | 945.0 | 945 | |
| Male | 18-24 | 905.0 | 905 |
| 25-34 | 1355.0 | 1355 | |
| 35-44 | 1347.0 | 1347 | |
| 45+ | 944.0 | 944 |
Each row corresponds to a unique combination of gender and age group. The equality between the two columns confirms that poststratify matched the full joint distribution without reverting to raking.
Future extensions (TODO)¶
- PS via
Sample.adjust(): Show the high-level API (sample.adjust(target, method="poststratify")) instead of callingpoststratify()directly. - Chained adjustments: Demonstrate IPW → poststratify (two-stage adjustment) with ASMD comparison at each stage. Example from balance notebook v03:
sample_with_target = sample.set_target(target) adjust_stage_1 = sample_with_target.adjust(method="ipw") adjust_stage_2 = adjust_stage_1.adjust(method="poststratify") adjust_stage_2.covars().asmd().T adjust_stage_2.outcomes().plot()
- Controlling PS variables via transformations: Show how
transformations={"age_group": lambda x: x}limits PS cells to only the named column. Example from balance notebook v03:transformations = {"age_group": lambda x: x} adjusted = ipw_adjusted.adjust(method="poststratify", transformations=transformations) adjusted.covars().asmd().T
formula=does NOT work with poststratify yet: The following is silently ignored (formula goes into**kwargsbut is never read). Show the working alternative:# Does NOT work as intended: adjusted = ipw_adjusted.adjust(method="poststratify", formula=["gender"]) # Use instead: adjusted = ipw_adjusted.adjust(method="poststratify", variables=["gender"])
- ASMD diagnostics: Show
adjusted.covars().asmd().Tandasmd_improvement()for poststratified samples. - Plotting: Add
adjusted.covars().plot()andadjusted.outcomes().plot()examples. - PS on binned numeric variables: Demonstrate using transformations (e.g.,
pd.cut) to bin continuous variables before PS. - Strict vs non-strict matching: Show what happens when sample has cells not in target (
strict_matching=False). - Weight trimming: Demonstrate
weight_trimming_mean_ratiowith poststratify.