balance Quickstart (post-stratify): Matching known cell totals¶
This notebook demonstrates how to apply post-stratification with the balance package. We start by matching a single marginal distribution and then show how the same function can match the joint distribution of two variables without falling back to raking.
1. Load simulated data¶
The helper :func:balance.load_data function returns a pair of simulated datasets: the target population and a biased sample. We'll use unit design weights in this tutorial so the weighted sums can be interpreted as counts.
from balance import load_data
from balance.weighting_methods.poststratify import poststratify
import pandas as pd
target_df, sample_df = load_data()
target_df.head()
INFO (2025-11-25 02:34:06,412) [__init__/<module> (line 70)]: Using balance version 0.12.x
Welcome to balance (Version 0.12.x)!
An open-source Python package for balancing biased data samples.
📖 Documentation: https://import-balance.org/
🛠️ Get Help / Report Issues: https://github.com/facebookresearch/balance/issues/
📄 Citation:
Sarig, T., Galili, T., & Eilat, R. (2023).
balance - a Python package for balancing biased data samples.
https://arxiv.org/abs/2307.06024
Tip: You can access this information at any time with balance.help()
| id | gender | age_group | income | happiness | |
|---|---|---|---|---|---|
| 0 | 100000 | Male | 45+ | 10.183951 | 61.706333 |
| 1 | 100001 | Male | 45+ | 6.036858 | 79.123670 |
| 2 | 100002 | Male | 35-44 | 5.226629 | 44.206949 |
| 3 | 100003 | NaN | 45+ | 5.752147 | 83.985716 |
| 4 | 100004 | NaN | 25-34 | 4.837484 | 49.339713 |
2. Post-stratify on a single variable¶
We first adjust the sample so that its gender distribution matches the target population. Rows with missing gender are dropped to satisfy the default strict_matching=True requirement that every sample cell be present in the target.
sample_gender = sample_df.dropna(subset=["gender"])
target_gender = target_df.dropna(subset=["gender"])
gender_result = poststratify(
sample_df=sample_gender[["gender"]],
sample_weights=pd.Series(1, index=sample_gender.index),
target_df=target_gender[["gender"]],
target_weights=pd.Series(1, index=target_gender.index),
)
gender_weights = sample_gender.assign(weight=gender_result["weight"])
gender_summary = pd.concat(
[
gender_weights.groupby("gender")["weight"].sum().rename("weighted_sample"),
target_gender.groupby("gender").size().rename("target_population"),
],
axis=1,
)
gender_summary
INFO (2025-11-25 02:34:06,448) [adjustment/apply_transformations (line 418)]: Adding the variables: []
INFO (2025-11-25 02:34:06,449) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender']
INFO (2025-11-25 02:34:06,453) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender']
| weighted_sample | target_population | |
|---|---|---|
| gender | ||
| Female | 4551.0 | 4551 |
| Male | 4551.0 | 4551 |
The weighted sample counts now reproduce the target population counts. Dividing by the column totals would show that the sample proportions also match the target proportions.
3. Post-stratify on the joint distribution of two variables¶
Post-stratification can use multiple variables simultaneously. In that case the function computes a weight per cell defined by the unique combination of those variables. This is different from raking, which iteratively matches the marginals of each variable.
covariates = ["gender", "age_group"]
sample_cells = sample_df.dropna(subset=covariates)
target_cells = target_df.dropna(subset=covariates)
joint_result = poststratify(
sample_df=sample_cells[covariates],
sample_weights=pd.Series(1, index=sample_cells.index),
target_df=target_cells[covariates],
target_weights=pd.Series(1, index=target_cells.index),
)
joint_weights = sample_cells.assign(weight=joint_result["weight"])
joint_summary = pd.concat(
[
joint_weights.groupby(covariates)["weight"].sum().rename("weighted_sample"),
target_cells.groupby(covariates).size().rename("target_population"),
],
axis=1,
)
joint_summary
INFO (2025-11-25 02:34:06,479) [adjustment/apply_transformations (line 418)]: Adding the variables: []
INFO (2025-11-25 02:34:06,480) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender', 'age_group']
INFO (2025-11-25 02:34:06,487) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender', 'age_group']
| weighted_sample | target_population | ||
|---|---|---|---|
| gender | age_group | ||
| Female | 18-24 | 876.0 | 876 |
| 25-34 | 1360.0 | 1360 | |
| 35-44 | 1370.0 | 1370 | |
| 45+ | 945.0 | 945 | |
| Male | 18-24 | 905.0 | 905 |
| 25-34 | 1355.0 | 1355 | |
| 35-44 | 1347.0 | 1347 | |
| 45+ | 944.0 | 944 |
Each row corresponds to a unique combination of gender and age group. The equality between the two columns confirms that poststratify matched the full joint distribution without reverting to raking.