balance Quickstart (post-stratify): Matching known cell totals¶

This notebook demonstrates how to apply post-stratification with the balance package. We start by matching a single marginal distribution and then show how the same function can match the joint distribution of two variables without falling back to raking.

1. Load simulated data¶

The helper :func:balance.load_data function returns a pair of simulated datasets: the target population and a biased sample. We'll use unit design weights in this tutorial so the weighted sums can be interpreted as counts.

In [1]:
from balance import load_data
from balance.weighting_methods.poststratify import poststratify
import pandas as pd

target_df, sample_df = load_data()
target_df.head()
INFO (2025-11-25 02:34:06,412) [__init__/<module> (line 70)]: Using balance version 0.12.x
Welcome to balance (Version 0.12.x)!
An open-source Python package for balancing biased data samples.

📖 Documentation: https://import-balance.org/
🛠️ Get Help / Report Issues: https://github.com/facebookresearch/balance/issues/
📄 Citation:
    Sarig, T., Galili, T., & Eilat, R. (2023).
    balance - a Python package for balancing biased data samples.
    https://arxiv.org/abs/2307.06024

Tip: You can access this information at any time with balance.help()

Out[1]:
id gender age_group income happiness
0 100000 Male 45+ 10.183951 61.706333
1 100001 Male 45+ 6.036858 79.123670
2 100002 Male 35-44 5.226629 44.206949
3 100003 NaN 45+ 5.752147 83.985716
4 100004 NaN 25-34 4.837484 49.339713

2. Post-stratify on a single variable¶

We first adjust the sample so that its gender distribution matches the target population. Rows with missing gender are dropped to satisfy the default strict_matching=True requirement that every sample cell be present in the target.

In [2]:
sample_gender = sample_df.dropna(subset=["gender"])
target_gender = target_df.dropna(subset=["gender"])

gender_result = poststratify(
    sample_df=sample_gender[["gender"]],
    sample_weights=pd.Series(1, index=sample_gender.index),
    target_df=target_gender[["gender"]],
    target_weights=pd.Series(1, index=target_gender.index),
)

gender_weights = sample_gender.assign(weight=gender_result["weight"])
gender_summary = pd.concat(
    [
        gender_weights.groupby("gender")["weight"].sum().rename("weighted_sample"),
        target_gender.groupby("gender").size().rename("target_population"),
    ],
    axis=1,
)
gender_summary
INFO (2025-11-25 02:34:06,448) [adjustment/apply_transformations (line 418)]: Adding the variables: []
INFO (2025-11-25 02:34:06,449) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender']
INFO (2025-11-25 02:34:06,453) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender']
Out[2]:
weighted_sample target_population
gender
Female 4551.0 4551
Male 4551.0 4551

The weighted sample counts now reproduce the target population counts. Dividing by the column totals would show that the sample proportions also match the target proportions.

3. Post-stratify on the joint distribution of two variables¶

Post-stratification can use multiple variables simultaneously. In that case the function computes a weight per cell defined by the unique combination of those variables. This is different from raking, which iteratively matches the marginals of each variable.

In [3]:
covariates = ["gender", "age_group"]
sample_cells = sample_df.dropna(subset=covariates)
target_cells = target_df.dropna(subset=covariates)

joint_result = poststratify(
    sample_df=sample_cells[covariates],
    sample_weights=pd.Series(1, index=sample_cells.index),
    target_df=target_cells[covariates],
    target_weights=pd.Series(1, index=target_cells.index),
)

joint_weights = sample_cells.assign(weight=joint_result["weight"])
joint_summary = pd.concat(
    [
        joint_weights.groupby(covariates)["weight"].sum().rename("weighted_sample"),
        target_cells.groupby(covariates).size().rename("target_population"),
    ],
    axis=1,
)
joint_summary
INFO (2025-11-25 02:34:06,479) [adjustment/apply_transformations (line 418)]: Adding the variables: []
INFO (2025-11-25 02:34:06,480) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender', 'age_group']
INFO (2025-11-25 02:34:06,487) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender', 'age_group']
Out[3]:
weighted_sample target_population
gender age_group
Female 18-24 876.0 876
25-34 1360.0 1360
35-44 1370.0 1370
45+ 945.0 945
Male 18-24 905.0 905
25-34 1355.0 1355
35-44 1347.0 1347
45+ 944.0 944

Each row corresponds to a unique combination of gender and age group. The equality between the two columns confirms that poststratify matched the full joint distribution without reverting to raking.