balance Quickstart (raking): Analyzing and adjusting the bias on a simulated toy dataset¶
The raking method is an advanced technique that extends post-stratification. It is well-suited for situations where we have marginal distributions of multiple covariates and we don't know the joint distribution. Raking works by applying post-stratification to the data based on the first covariate, using the resulting output weights as input for adjustment based on the second covariate, and so forth. Once all covariates have been utilized for adjustment, the process is repeated until a specified level of convergence is attained
One of the main advantages of raking is its ability to work with user-level data while also utilizing marginal distributions that lack user-level granularity. Another benefit is its capacity to closely fit these distributions, depending on the convergence achieved. This is in contrast to techniques such as inverse probability weighting (IPW) and covariate balancing propensity score (CBPS), which may only approximate the data and potentially fail to fit them even at marginal levels.
This notebook demonstrates how to use the raking method and showcases the high degree of fit it can provide.
Load the data¶
%matplotlib inline
import plotly.offline as offline
offline.init_notebook_mode()
from balance import load_data
INFO (2025-12-07 10:45:08,707) [__init__/<module> (line 68)]: Using balance version 0.13.x
balance (Version 0.13.x) loaded:
📖 Documentation: https://import-balance.org/
🛠️ Help / Issues: https://github.com/facebookresearch/balance/issues/
📄 Citation:
Sarig, T., Galili, T., & Eilat, R. (2023).
balance - a Python package for balancing biased data samples.
https://arxiv.org/abs/2307.06024
Tip: You can view this message anytime with balance.help()
target_df, sample_df = load_data()
print("target_df: \n", target_df.head())
print("sample_df: \n", sample_df.head())
target_df:
id gender age_group income happiness
0 100000 Male 45+ 10.183951 61.706333
1 100001 Male 45+ 6.036858 79.123670
2 100002 Male 35-44 5.226629 44.206949
3 100003 NaN 45+ 5.752147 83.985716
4 100004 NaN 25-34 4.837484 49.339713
sample_df:
id gender age_group income happiness
0 0 Male 25-34 6.428659 26.043029
1 1 Female 18-24 9.940280 66.885485
2 2 Male 18-24 2.673623 37.091922
3 3 NaN 18-24 10.550308 49.394050
4 4 NaN 18-24 2.689994 72.304208
from balance import Sample
Raking can work with numerical variables since the variable is automatically bucketed. But for the simplicitiy of the discussion, we'll focus only on age and gender.
sample = Sample.from_frame(sample_df[['id', 'gender', 'age_group', 'happiness']], outcome_columns=["happiness"])
target = Sample.from_frame(target_df[['id', 'gender', 'age_group', 'happiness']], outcome_columns=["happiness"])
sample_with_target = sample.set_target(target)
WARNING (2025-12-07 10:45:08,924) [util/guess_id_column (line 304)]: Guessed id column name id for the data
WARNING (2025-12-07 10:45:08,933) [sample_class/from_frame (line 393)]: No weights passed. Adding a 'weight' column and setting all values to 1
WARNING (2025-12-07 10:45:08,942) [util/guess_id_column (line 304)]: Guessed id column name id for the data
WARNING (2025-12-07 10:45:08,956) [sample_class/from_frame (line 393)]: No weights passed. Adding a 'weight' column and setting all values to 1
Fit models using ipw and rake¶
Fit an ipw model:
adjusted_ipw = sample_with_target.adjust(method = "ipw")
INFO (2025-12-07 10:45:08,969) [ipw/ipw (line 622)]: Starting ipw function
INFO (2025-12-07 10:45:08,971) [adjustment/apply_transformations (line 449)]: Adding the variables: []
INFO (2025-12-07 10:45:08,972) [adjustment/apply_transformations (line 450)]: Transforming the variables: ['gender', 'age_group']
INFO (2025-12-07 10:45:08,979) [adjustment/apply_transformations (line 487)]: Final variables in output: ['gender', 'age_group']
INFO (2025-12-07 10:45:08,984) [ipw/ipw (line 656)]: Building model matrix
INFO (2025-12-07 10:45:09,048) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['gender + age_group + _is_na_gender']
INFO (2025-12-07 10:45:09,049) [ipw/ipw (line 681)]: The number of columns in the model matrix: 7
INFO (2025-12-07 10:45:09,049) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2025-12-07 10:45:23,633) [ipw/ipw (line 841)]: Done with sklearn
INFO (2025-12-07 10:45:23,634) [ipw/ipw (line 843)]: max_de: None
INFO (2025-12-07 10:45:23,634) [ipw/ipw (line 865)]: Starting model selection
INFO (2025-12-07 10:45:23,637) [ipw/ipw (line 898)]: Chosen lambda: 0.041158338186664825
INFO (2025-12-07 10:45:23,638) [ipw/ipw (line 916)]: Proportion null deviance explained 0.11579573851363634
Fit a raking model (on the user level data as input):
adjusted_rake = sample_with_target.adjust(method = "rake")
INFO (2025-12-07 10:45:23,652) [adjustment/apply_transformations (line 449)]: Adding the variables: []
INFO (2025-12-07 10:45:23,653) [adjustment/apply_transformations (line 450)]: Transforming the variables: ['gender', 'age_group']
INFO (2025-12-07 10:45:23,659) [adjustment/apply_transformations (line 487)]: Final variables in output: ['gender', 'age_group']
INFO (2025-12-07 10:45:23,668) [rake/rake (line 265)]: Final covariates and levels that will be used in raking: {'age_group': ['18-24', '25-34', '35-44', '45+'], 'gender': ['Female', 'Male', '__NaN__']}.
When comparing the results of ipw and rake, we can see that rake has a larger design effect, and that it provides a perfect fit. In contrast, ipw gives only a partial fit.
We can see it in the ASMD and also the bar plots.
print(adjusted_ipw.summary())
Adjustment details:
method: ipw
weight trimming mean ratio: 20
design effect (Deff): 1.527, eff. sample size: 654.8
Covar ASMD reduction: 77.6%, design effect: 1.527
Covar ASMD (6 variables): 0.243 -> 0.055
Covar mean KLD reduction: 94.1%
Covar mean KLD (2 variables): 0.049 -> 0.003
Model performance: Model proportion deviance explained: 0.116
print(adjusted_rake.summary())
Adjustment details:
method: rake
design effect (Deff): 2.103, eff. sample size: 475.6
Covar ASMD reduction: 100.0%, design effect: 2.103
Covar ASMD (6 variables): 0.243 -> 0.000
Covar mean KLD reduction: 100.0%
Covar mean KLD (2 variables): 0.049 -> 0.000
adjusted_ipw.covars().plot()
adjusted_rake.covars().plot()
Using marginal distribution with rake¶
The benefit of rake is that we can define a target population from a marginal distribution, and fit towards it.
The function to use for this purpose is prepare_marginal_dist_for_raking.
In order to demonstrate this point, let us assume we have another target population in mind, with different proportions. Since it is known, we can create a sample with that target population based on a dict of marginal distributions using the realize_dicts_of_proportions function.
from balance.weighting_methods.rake import prepare_marginal_dist_for_raking
# import pandas as pd
import numpy as np
a_dict_with_marginal_distributions = {"gender": {"Female": 0.1, "Male": 0.85, np.nan: 0.05}, "age_group": {"18-24": 0.25, "25-34": 0.25, "35-44": 0.25, "45+": 0.25}}
target_df_from_marginals = prepare_marginal_dist_for_raking(a_dict_with_marginal_distributions)
target_df_from_marginals
| gender | age_group | id | |
|---|---|---|---|
| 0 | Female | 18-24 | 0 |
| 1 | Female | 25-34 | 1 |
| 2 | Male | 35-44 | 2 |
| 3 | Male | 45+ | 3 |
| 4 | Male | 18-24 | 4 |
| 5 | Male | 25-34 | 5 |
| 6 | Male | 35-44 | 6 |
| 7 | Male | 45+ | 7 |
| 8 | Male | 18-24 | 8 |
| 9 | Male | 25-34 | 9 |
| 10 | Male | 35-44 | 10 |
| 11 | Male | 45+ | 11 |
| 12 | Male | 18-24 | 12 |
| 13 | Male | 25-34 | 13 |
| 14 | Male | 35-44 | 14 |
| 15 | Male | 45+ | 15 |
| 16 | Male | 18-24 | 16 |
| 17 | Male | 25-34 | 17 |
| 18 | Male | 35-44 | 18 |
| 19 | NaN | 45+ | 19 |
target_df_from_marginals.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20 entries, 0 to 19 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 19 non-null object 1 age_group 20 non-null object 2 id 20 non-null int64 dtypes: int64(1), object(2) memory usage: 608.0+ bytes
With the new target_df_from_marginals object ready, we can use it as a target. Notice that this makes sense ONLY for the raking method. This should NOT be used for any other method.
target_from_marginals = Sample.from_frame(target_df_from_marginals)
sample_with_target_2 = sample.set_target(target_from_marginals)
WARNING (2025-12-07 10:45:24,671) [util/guess_id_column (line 304)]: Guessed id column name id for the data
WARNING (2025-12-07 10:45:24,672) [sample_class/from_frame (line 319)]: Casting id column to string
WARNING (2025-12-07 10:45:24,679) [util/_warn_of_df_dtypes_change (line 2200)]: The dtypes of sample._df were changed from the original dtypes of the input df, here are the differences -
WARNING (2025-12-07 10:45:24,679) [util/_warn_of_df_dtypes_change (line 2209)]: The (old) dtypes that changed for df (before the change):
WARNING (2025-12-07 10:45:24,681) [util/_warn_of_df_dtypes_change (line 2212)]: id int64 dtype: object
WARNING (2025-12-07 10:45:24,681) [util/_warn_of_df_dtypes_change (line 2213)]: The (new) dtypes saved in df (after the change):
WARNING (2025-12-07 10:45:24,682) [util/_warn_of_df_dtypes_change (line 2214)]: id object dtype: object
WARNING (2025-12-07 10:45:24,683) [sample_class/from_frame (line 393)]: No weights passed. Adding a 'weight' column and setting all values to 1
And fit a raking model:
adjusted_rake_2 = sample_with_target_2.adjust(method = "rake")
INFO (2025-12-07 10:45:24,693) [adjustment/apply_transformations (line 449)]: Adding the variables: []
INFO (2025-12-07 10:45:24,694) [adjustment/apply_transformations (line 450)]: Transforming the variables: ['gender', 'age_group']
INFO (2025-12-07 10:45:24,697) [adjustment/apply_transformations (line 487)]: Final variables in output: ['gender', 'age_group']
INFO (2025-12-07 10:45:24,702) [rake/rake (line 265)]: Final covariates and levels that will be used in raking: {'age_group': ['18-24', '25-34', '35-44', '45+'], 'gender': ['Female', 'Male', '__NaN__']}.
As the following code shows, we get our data to have a perfect fit to the marginal distribution defined for age and gender.
print(adjusted_rake_2.summary())
Adjustment details:
method: rake
design effect (Deff): 2.176, eff. sample size: 459.6
Covar ASMD reduction: 100.0%, design effect: 2.176
Covar ASMD (6 variables): 0.341 -> 0.000
Covar mean KLD reduction: 100.0%
Covar mean KLD (2 variables): 0.071 -> 0.000
adjusted_rake_2.covars().plot()