balance: transformations and formulas¶
This tutorial focuses on the ways transformations, formulas and penalty can be included in your pre-processing of the coveriates before adjusting for them.
Example dataset - preparing the objects¶
The following is a toy simulated dataset.
For a more basic walkthrough of the elements in the next code block, please take a look at the tutorial: balance Quickstart: Analyzing and adjusting the bias on a simulated toy dataset
from balance import load_data
target_df, sample_df = load_data()
from balance import Sample
sample = Sample.from_frame(sample_df, outcome_columns=["happiness"])
target = Sample.from_frame(target_df, outcome_columns=["happiness"])
sample_with_target = sample.set_target(target)
sample_with_target
INFO (2026-01-06 22:12:29,970) [__init__/<module> (line 72)]: Using balance version 0.14.0
WARNING (2026-01-06 22:12:30,161) [input_validation/guess_id_column (line 153)]: Guessed id column name id for the data
WARNING (2026-01-06 22:12:30,171) [sample_class/from_frame (line 508)]: No weights passed. Adding a 'weight' column and setting all values to 1
balance (Version 0.14.0) loaded:
š Documentation: https://import-balance.org/
š ļø Help / Issues: https://github.com/facebookresearch/balance/issues/
š Citation:
Sarig, T., Galili, T., & Eilat, R. (2023).
balance - a Python package for balancing biased data samples.
https://arxiv.org/abs/2307.06024
Tip: You can view this message anytime with balance.help()
WARNING (2026-01-06 22:12:30,179) [input_validation/guess_id_column (line 153)]: Guessed id column name id for the data
WARNING (2026-01-06 22:12:30,193) [sample_class/from_frame (line 508)]: No weights passed. Adding a 'weight' column and setting all values to 1
(balance.sample_class.Sample)
balance Sample object with target set
1000 observations x 3 variables: gender,age_group,income
id_column: id, weight_column: weight,
outcome_columns: happiness
target:
balance Sample object
10000 observations x 3 variables: gender,age_group,income
id_column: id, weight_column: weight,
outcome_columns: happiness
3 common variables: gender,age_group,income
When trying to understand what an adjustment does, we can look at the model_coef items collected from the diagnostics method.
adjusted = sample_with_target.adjust(
# method="ipw", # default method
# transformations=None,
# formula=None,
# penalty_factor=None, # all 1s
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:12:30,212) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:12:30,216) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:12:30,216) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-01-06 22:12:30,228) [adjustment/apply_transformations (line 507)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-01-06 22:12:30,236) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:12:30,334) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-01-06 22:12:30,335) [ipw/ipw (line 681)]: The number of columns in the model matrix: 16
INFO (2026-01-06 22:12:30,335) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:12:47,332) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:12:47,333) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:12:47,334) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:12:47,337) [ipw/ipw (line 900)]: Chosen lambda: 0.041158338186664825
INFO (2026-01-06 22:12:47,338) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17265121909892267
INFO (2026-01-06 22:12:47,342) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:12:47,612) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | 0.142821 | intercept |
| 45 | model_coef | 0.043803 | _is_na_gender[T.True] |
| 46 | model_coef | -0.203801 | age_group[T.25-34] |
| 47 | model_coef | -0.428742 | age_group[T.35-44] |
| 48 | model_coef | -0.529629 | age_group[T.45+] |
| 49 | model_coef | 0.332477 | gender[T.Male] |
| 50 | model_coef | 0.043803 | gender[T._NA] |
| 51 | model_coef | 0.168359 | income[Interval(-0.0009997440000000001, 0.44, ... |
| 52 | model_coef | 0.152978 | income[Interval(0.44, 1.664, closed='right')] |
| 53 | model_coef | 0.110040 | income[Interval(1.664, 3.472, closed='right')] |
| 54 | model_coef | -0.042502 | income[Interval(11.312, 15.139, closed='right')] |
| 55 | model_coef | -0.162192 | income[Interval(15.139, 20.567, closed='right')] |
| 56 | model_coef | -0.212078 | income[Interval(20.567, 29.504, closed='right')] |
| 57 | model_coef | -0.358711 | income[Interval(29.504, 128.536, closed='right')] |
| 58 | model_coef | 0.092556 | income[Interval(3.472, 5.663, closed='right')] |
| 59 | model_coef | 0.071799 | income[Interval(5.663, 8.211, closed='right')] |
| 60 | model_coef | 0.004690 | income[Interval(8.211, 11.312, closed='right')] |
As we can see from the glm coefficients, the age and gender groups got an extra NA column. And the income variable is bucketed into 10 buckets.
We can change these defaults by deciding on the specific transformation we want.
Let's start with NO transformations.
The transformation argument accepts either a dict or None. None indicates no transformations.
adjusted = sample_with_target.adjust(
# method="ipw",
transformations=None,
# formula=formula,
# penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:12:47,631) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:12:47,633) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:12:47,700) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-01-06 22:12:47,701) [ipw/ipw (line 681)]: The number of columns in the model matrix: 8
INFO (2026-01-06 22:12:47,701) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:13:03,538) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:13:03,539) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:13:03,540) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:13:03,543) [ipw/ipw (line 900)]: Chosen lambda: 0.0368353720078807
INFO (2026-01-06 22:13:03,544) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17353587606008936
INFO (2026-01-06 22:13:03,548) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:13:03,820) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | 0.998089 | intercept |
| 45 | model_coef | 0.000043 | _is_na_gender[T.True] |
| 46 | model_coef | -0.212989 | age_group[T.25-34] |
| 47 | model_coef | -0.440492 | age_group[T.35-44] |
| 48 | model_coef | -0.545653 | age_group[T.45+] |
| 49 | model_coef | -0.188196 | gender[Female] |
| 50 | model_coef | 0.181710 | gender[Male] |
| 51 | model_coef | 0.000043 | gender[_NA] |
| 52 | model_coef | -0.570551 | income |
In this setting, income was treated as a numeric variable, with no transformations (e.g.: bucketing) on it. Regardless of the transformations, the model matrix made sure to turn the gender and age_group into dummy variables (including a column for NA).
Next we can fit a simple transformation.
Let's say we wanted to bucket age_groups groups that are smaller than 25% of the data, and use different bucketing on income, here is how we'd do it:
from balance.util import fct_lump, quantize
transformations = {
"age_group": lambda x: fct_lump(x, 0.25),
"gender": lambda x: x,
"income": lambda x: quantize(x.fillna(x.mean()), q=3),
}
adjusted = sample_with_target.adjust(
# method="ipw",
transformations=transformations,
# formula=formula,
# penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:13:03,838) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:13:03,840) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:13:03,841) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:13:03,849) [adjustment/apply_transformations (line 507)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:13:03,858) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:13:03,954) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-01-06 22:13:03,955) [ipw/ipw (line 681)]: The number of columns in the model matrix: 8
INFO (2026-01-06 22:13:03,956) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:13:18,203) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:13:18,204) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:13:18,205) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:13:18,207) [ipw/ipw (line 900)]: Chosen lambda: 0.11811067639400605
INFO (2026-01-06 22:13:18,208) [ipw/ipw (line 918)]: Proportion null deviance explained 0.09420328935831213
WARNING (2026-01-06 22:13:18,209) [ipw/ipw (line 926)]: The propensity model has low fraction null deviance explained (0.09420328935831213). Results may not be accurate
INFO (2026-01-06 22:13:18,213) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:13:18,481) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | -0.327112 | intercept |
| 45 | model_coef | 0.031737 | _is_na_gender[T.True] |
| 46 | model_coef | -0.181989 | age_group[T.35-44] |
| 47 | model_coef | 0.098871 | age_group[T._lumped_other] |
| 48 | model_coef | 0.241586 | gender[T.Male] |
| 49 | model_coef | 0.031737 | gender[T._NA] |
| 50 | model_coef | 0.199260 | income[Interval(-0.0009997440000000001, 4.194,... |
| 51 | model_coef | -0.283084 | income[Interval(13.693, 128.536, closed='right')] |
| 52 | model_coef | 0.048146 | income[Interval(4.194, 13.693, closed='right')] |
As we can see - we managed to change the bucket sizes of income to have only 3 buckets, and we lumped the age_group to two groups (and collapsed together "small" buckets into the _lumped_other bucket).
Lastly, notice that if we omit a variable from transformations, it will not be available for the model construction (This behavior might change in the future).
transformations = {
# "age_group": lambda x: fct_lump(x, 0.25),
"gender": lambda x: x,
# "income": lambda x: quantize(x.fillna(x.mean()), q=3),
}
adjusted = sample_with_target.adjust(
# method="ipw",
transformations=transformations,
# formula=formula,
# penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:13:18,499) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:13:18,502) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:13:18,502) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['gender']
WARNING (2026-01-06 22:13:18,503) [adjustment/apply_transformations (line 504)]: Dropping the variables: ['age_group', 'income']
INFO (2026-01-06 22:13:18,504) [adjustment/apply_transformations (line 507)]: Final variables in output: ['gender']
INFO (2026-01-06 22:13:18,507) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:13:18,545) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['gender + _is_na_gender']
INFO (2026-01-06 22:13:18,546) [ipw/ipw (line 681)]: The number of columns in the model matrix: 4
INFO (2026-01-06 22:13:18,547) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:13:27,572) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:13:27,572) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:13:27,573) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:13:27,576) [ipw/ipw (line 900)]: Chosen lambda: 0.20570886693214954
INFO (2026-01-06 22:13:27,577) [ipw/ipw (line 918)]: Proportion null deviance explained 0.0264854344569313
WARNING (2026-01-06 22:13:27,578) [ipw/ipw (line 926)]: The propensity model has low fraction null deviance explained (0.0264854344569313). Results may not be accurate
INFO (2026-01-06 22:13:27,582) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:13:27,848) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | -0.035465 | intercept |
| 45 | model_coef | 0.001051 | _is_na_gender[T.True] |
| 46 | model_coef | -0.141695 | gender[Female] |
| 47 | model_coef | 0.136225 | gender[Male] |
| 48 | model_coef | 0.001051 | gender[_NA] |
As we can see, only gender was included in the model.
# TODO: add more examples about how add_na works
# TODO: add more examples about rare values in categorical variables and how they are grouped together.
Creating new variables¶
In the next example we will create several new transformations of income.
The info gives information on which variables were added, which were transformed, and what is the final variables in the output.
The x in the lambda function can have one of two meanings:
- When the keys in the dict match the exact names of the variables in the DataFrame (e.g.: "income"), then the lambda function treats x as the pandas.Series of that variable.
- If the name of the key does NOT exist in the DataFrame (e.g.: "income_squared"), then x will become the DataFrame of the data.
from balance.util import fct_lump, quantize
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
"income": lambda x: x,
"income_squared": lambda x: x.income**2,
"income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=3),
}
adjusted = sample_with_target.adjust(
# method="ipw",
transformations=transformations,
# formula=formula,
# penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:13:27,870) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:13:27,873) [adjustment/apply_transformations (line 469)]: Adding the variables: ['income_squared', 'income_buckets']
INFO (2026-01-06 22:13:27,873) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:13:27,879) [adjustment/apply_transformations (line 507)]: Final variables in output: ['income_squared', 'income_buckets', 'age_group', 'gender', 'income']
INFO (2026-01-06 22:13:27,893) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:13:27,993) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['income_squared + income_buckets + income + gender + age_group + _is_na_gender']
INFO (2026-01-06 22:13:27,994) [ipw/ipw (line 681)]: The number of columns in the model matrix: 11
INFO (2026-01-06 22:13:27,994) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:13:50,100) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:13:50,101) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:13:50,102) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:13:50,105) [ipw/ipw (line 900)]: Chosen lambda: 0.043506507030756265
INFO (2026-01-06 22:13:50,106) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17223252304635372
INFO (2026-01-06 22:13:50,110) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:13:50,369) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | 0.417450 | intercept |
| 45 | model_coef | 0.044654 | _is_na_gender[T.True] |
| 46 | model_coef | -0.194436 | age_group[T.25-34] |
| 47 | model_coef | -0.421207 | age_group[T.35-44] |
| 48 | model_coef | -0.521732 | age_group[T.45+] |
| 49 | model_coef | 0.325806 | gender[T.Male] |
| 50 | model_coef | 0.044654 | gender[T._NA] |
| 51 | model_coef | -0.266020 | income |
| 52 | model_coef | 0.113306 | income_buckets[Interval(-0.0009997440000000001... |
| 53 | model_coef | -0.166366 | income_buckets[Interval(13.693, 128.536, close... |
| 54 | model_coef | 0.032296 | income_buckets[Interval(4.194, 13.693, closed=... |
| 55 | model_coef | -0.185931 | income_squared |
Formula¶
The formula can accept a list of strings indicating how to combine the transformed variables together. It follows the formula notation from patsy.
For example, we can have an interaction between age_group and gender:
from balance.util import fct_lump_by, quantize
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
"income": lambda x: quantize(x.fillna(x.mean()), q=20),
}
formula = ["age_group * gender"]
# the penalty is per elemnt in the list of formula:
# penalty_factor = [0.1, 0.1, 0.1]
adjusted = sample_with_target.adjust(
method="ipw",
transformations=transformations,
formula=formula,
# penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:13:50,388) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:13:50,390) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:13:50,391) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:13:50,396) [adjustment/apply_transformations (line 507)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:13:50,405) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:13:50,470) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['age_group * gender']
INFO (2026-01-06 22:13:50,471) [ipw/ipw (line 681)]: The number of columns in the model matrix: 12
INFO (2026-01-06 22:13:50,471) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:14:08,437) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:14:08,438) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:14:08,439) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:14:08,442) [ipw/ipw (line 900)]: Chosen lambda: 0.0894967426547247
INFO (2026-01-06 22:14:08,443) [ipw/ipw (line 918)]: Proportion null deviance explained 0.11496302030801142
INFO (2026-01-06 22:14:08,447) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:14:08,710) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | -0.348132 | intercept |
| 45 | model_coef | 0.334726 | age_group[18-24] |
| 46 | model_coef | 0.005414 | age_group[25-34] |
| 47 | model_coef | -0.160449 | age_group[35-44] |
| 48 | model_coef | -0.280501 | age_group[45+] |
| 49 | model_coef | 0.032004 | age_group[T.25-34]:gender[T.Male] |
| 50 | model_coef | 0.016476 | age_group[T.25-34]:gender[T._NA] |
| 51 | model_coef | -0.037461 | age_group[T.35-44]:gender[T.Male] |
| 52 | model_coef | -0.046181 | age_group[T.35-44]:gender[T._NA] |
| 53 | model_coef | -0.031939 | age_group[T.45+]:gender[T.Male] |
| 54 | model_coef | -0.042359 | age_group[T.45+]:gender[T._NA] |
| 55 | model_coef | 0.271811 | gender[T.Male] |
| 56 | model_coef | 0.062330 | gender[T._NA] |
As we can see, the formula makes it so that we have combinations of age_group and gender, as well as a main effects of age_group and gender. Since income was not in the formula, it is not included in the model.
Formula and penalty_factor¶
The formula can be provided as several strings, and then the penalty factor can indicate how much the model should focus to adjust to that element of the formula. Larger penalty factors means that element will be less corrected.
The next two examples shows how in one case we focus on correcting for income, and in the second case we focus to correct for age and gender.
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
"income": lambda x: x,
}
formula = ["age_group + gender", "income"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [10, 0.1]
adjusted = sample_with_target.adjust(
method="ipw",
transformations=transformations,
formula=formula,
penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:14:08,728) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:14:08,730) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:14:08,730) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:14:08,732) [adjustment/apply_transformations (line 507)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:14:08,739) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:14:08,805) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['age_group + gender', 'income']
INFO (2026-01-06 22:14:08,806) [ipw/ipw (line 681)]: The number of columns in the model matrix: 7
INFO (2026-01-06 22:14:08,806) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:14:37,216) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:14:37,217) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:14:37,217) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:14:37,221) [ipw/ipw (line 900)]: Chosen lambda: 0.0009460271806598614
INFO (2026-01-06 22:14:37,221) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17289162681789683
INFO (2026-01-06 22:14:37,225) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:14:37,485) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | 0.243384 | intercept |
| 45 | model_coef | 3.238671 | age_group[18-24] |
| 46 | model_coef | 0.394707 | age_group[25-34] |
| 47 | model_coef | -1.759083 | age_group[35-44] |
| 48 | model_coef | -2.919449 | age_group[45+] |
| 49 | model_coef | 2.599977 | gender[T.Male] |
| 50 | model_coef | 0.487265 | gender[T._NA] |
| 51 | model_coef | -0.073744 | income |
The above example corrected more to income. As we can see, age and gender got 0 correction (since their penalty was so high). Let's now over correct for age and gender:
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
"income": lambda x: x,
}
formula = ["age_group + gender", "income"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [0.1, 10] # this is flipped
adjusted = sample_with_target.adjust(
method="ipw",
transformations=transformations,
formula=formula,
penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:14:37,503) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:14:37,505) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:14:37,506) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:14:37,507) [adjustment/apply_transformations (line 507)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:14:37,515) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:14:37,582) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['age_group + gender', 'income']
INFO (2026-01-06 22:14:37,583) [ipw/ipw (line 681)]: The number of columns in the model matrix: 7
INFO (2026-01-06 22:14:37,584) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:15:03,978) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:15:03,979) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:15:03,979) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:15:03,983) [ipw/ipw (line 900)]: Chosen lambda: 0.0012484913627071772
INFO (2026-01-06 22:15:03,983) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17058848391571657
INFO (2026-01-06 22:15:03,987) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:15:04,251) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | -0.436751 | intercept |
| 45 | model_coef | 0.053355 | age_group[18-24] |
| 46 | model_coef | 0.014525 | age_group[25-34] |
| 47 | model_coef | -0.014981 | age_group[35-44] |
| 48 | model_coef | -0.037504 | age_group[45+] |
| 49 | model_coef | 0.041851 | gender[T.Male] |
| 50 | model_coef | 0.011851 | gender[T._NA] |
| 51 | model_coef | -3.824697 | income |
In the above case, income basically got 0 correction.
We can add two versions of income, and give each of them a higher penalty than the age and gender:
from balance.util import fct_lump_by, quantize
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
"income": lambda x: x,
"income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=4),
}
formula = ["age_group + gender", "income", "income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 2, 2]
adjusted = sample_with_target.adjust(
method="ipw",
transformations=transformations,
formula=formula,
penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:15:04,269) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:15:04,271) [adjustment/apply_transformations (line 469)]: Adding the variables: ['income_buckets']
INFO (2026-01-06 22:15:04,272) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:15:04,277) [adjustment/apply_transformations (line 507)]: Final variables in output: ['income_buckets', 'age_group', 'gender', 'income']
INFO (2026-01-06 22:15:04,289) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:15:04,386) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['age_group + gender', 'income', 'income_buckets']
INFO (2026-01-06 22:15:04,387) [ipw/ipw (line 681)]: The number of columns in the model matrix: 11
INFO (2026-01-06 22:15:04,388) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:15:24,759) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:15:24,759) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:15:24,760) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:15:24,763) [ipw/ipw (line 900)]: Chosen lambda: 0.043506507030756265
INFO (2026-01-06 22:15:24,764) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17297124065566938
INFO (2026-01-06 22:15:24,768) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:15:25,034) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | -0.286191 | intercept |
| 45 | model_coef | 0.378645 | age_group[18-24] |
| 46 | model_coef | 0.047595 | age_group[25-34] |
| 47 | model_coef | -0.199745 | age_group[35-44] |
| 48 | model_coef | -0.350982 | age_group[45+] |
| 49 | model_coef | 0.321907 | gender[T.Male] |
| 50 | model_coef | 0.074456 | gender[T._NA] |
| 51 | model_coef | -0.418356 | income |
| 52 | model_coef | 0.216575 | income_buckets[Interval(-0.0009997440000000001... |
| 53 | model_coef | -0.366924 | income_buckets[Interval(17.694, 128.536, close... |
| 54 | model_coef | 0.129850 | income_buckets[Interval(2.53, 8.211, closed='r... |
| 55 | model_coef | -0.051010 | income_buckets[Interval(8.211, 17.694, closed=... |
Another way is to create a formula for several variations of each variable, and give each a penalty of 1. For example:
from balance.util import fct_lump_by, quantize
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
"income": lambda x: x,
"income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=4),
}
formula = ["age_group", "gender", "income + income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 1, 1]
adjusted = sample_with_target.adjust(
method="ipw",
transformations=transformations,
formula=formula,
penalty_factor=penalty_factor,
# max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-01-06 22:15:25,053) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:15:25,055) [adjustment/apply_transformations (line 469)]: Adding the variables: ['income_buckets']
INFO (2026-01-06 22:15:25,056) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:15:25,061) [adjustment/apply_transformations (line 507)]: Final variables in output: ['income_buckets', 'age_group', 'gender', 'income']
INFO (2026-01-06 22:15:25,073) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:15:25,174) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['age_group', 'gender', 'income + income_buckets']
INFO (2026-01-06 22:15:25,175) [ipw/ipw (line 681)]: The number of columns in the model matrix: 12
INFO (2026-01-06 22:15:25,176) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:15:41,535) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:15:41,536) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:15:41,537) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:15:41,540) [ipw/ipw (line 900)]: Chosen lambda: 0.0894967426547247
INFO (2026-01-06 22:15:41,540) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17297566853458446
INFO (2026-01-06 22:15:41,545) [sample_class/diagnostics (line 1406)]: Starting computation of diagnostics of the fitting
INFO (2026-01-06 22:15:41,813) [sample_class/diagnostics (line 1627)]: Done computing diagnostics
| metric | val | var | |
|---|---|---|---|
| 44 | model_coef | 0.084942 | intercept |
| 45 | model_coef | 0.327596 | age_group[18-24] |
| 46 | model_coef | 0.039282 | age_group[25-34] |
| 47 | model_coef | -0.176394 | age_group[35-44] |
| 48 | model_coef | -0.296639 | age_group[45+] |
| 49 | model_coef | -0.163857 | gender[Female] |
| 50 | model_coef | 0.158443 | gender[Male] |
| 51 | model_coef | -0.000375 | gender[_NA] |
| 52 | model_coef | -0.258364 | income |
| 53 | model_coef | 0.122028 | income_buckets[Interval(-0.0009997440000000001... |
| 54 | model_coef | -0.213971 | income_buckets[Interval(17.694, 128.536, close... |
| 55 | model_coef | 0.076064 | income_buckets[Interval(2.53, 8.211, closed='r... |
| 56 | model_coef | -0.025416 | income_buckets[Interval(8.211, 17.694, closed=... |
# Defaults from the package
adjusted = sample_with_target.adjust(
# max_de=None,
)
print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")
INFO (2026-01-06 22:15:41,831) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:15:41,834) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:15:41,834) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-01-06 22:15:41,844) [adjustment/apply_transformations (line 507)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-01-06 22:15:41,852) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:15:41,949) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-01-06 22:15:41,949) [ipw/ipw (line 681)]: The number of columns in the model matrix: 16
INFO (2026-01-06 22:15:41,950) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:15:59,230) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:15:59,231) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:15:59,232) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:15:59,235) [ipw/ipw (line 900)]: Chosen lambda: 0.041158338186664825
INFO (2026-01-06 22:15:59,236) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17265121909892267
Adjustment details:
method: ipw
weight trimming mean ratio: 20
Covariate diagnostics:
Covar ASMD reduction: 63.4%
Covar ASMD (7 variables): 0.327 -> 0.119
Covar mean KLD reduction: 95.3%
Covar mean KLD (3 variables): 0.071 -> 0.003
Weight diagnostics:
design effect (Deff): 1.880
effective sample size proportion (ESSP): 0.532
effective sample size (ESS): 531.8
Outcome weighted means:
happiness
source
self 53.297
target 56.278
unadjusted 48.559
Model performance: Model proportion deviance explained: 0.173
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source self target unadjusted self_ci target_ci unadjusted_ci
happiness 53.297 56.278 48.559 (52.097, 54.496) (55.961, 56.595) (47.669, 49.449)
Response rates (relative to number of respondents in sample):
happiness
n 1000.0
% 100.0
Response rates (relative to notnull rows in the target):
happiness
n 1000.0
% 10.0
Response rates (in the target):
happiness
n 10000.0
% 100.0
# No transformations at all
# transformations = None is just like using:
# transformations = {
# "age_group": lambda x: x,
# "gender": lambda x: x,
# "income": lambda x: x,
# }
adjusted = sample_with_target.adjust(
method="ipw",
transformations=None,
# formula=formula,
# penalty_factor=penalty_factor,
# max_de=None,
)
print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")
# slightly smaller design effect, slightly better ASMD reduction.
INFO (2026-01-06 22:16:00,782) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:16:00,784) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:16:00,851) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-01-06 22:16:00,852) [ipw/ipw (line 681)]: The number of columns in the model matrix: 8
INFO (2026-01-06 22:16:00,852) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:16:16,876) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:16:16,877) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:16:16,878) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:16:16,881) [ipw/ipw (line 900)]: Chosen lambda: 0.0368353720078807
INFO (2026-01-06 22:16:16,882) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17353587606008936
Adjustment details:
method: ipw
weight trimming mean ratio: 20
Covariate diagnostics:
Covar ASMD reduction: 68.5%
Covar ASMD (7 variables): 0.327 -> 0.103
Covar mean KLD reduction: 96.8%
Covar mean KLD (3 variables): 0.071 -> 0.002
Weight diagnostics:
design effect (Deff): 2.087
effective sample size proportion (ESSP): 0.479
effective sample size (ESS): 479.2
Outcome weighted means:
happiness
source
self 53.731
target 56.278
unadjusted 48.559
Model performance: Model proportion deviance explained: 0.174
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source self target unadjusted self_ci target_ci unadjusted_ci
happiness 53.731 56.278 48.559 (52.513, 54.949) (55.961, 56.595) (47.669, 49.449)
Response rates (relative to number of respondents in sample):
happiness
n 1000.0
% 100.0
Response rates (relative to notnull rows in the target):
happiness
n 1000.0
% 10.0
Response rates (in the target):
happiness
n 10000.0
% 100.0
# No transformations at all
transformations = None
# But passing a squared term of income to the formula:
formula = ["age_group + gender + income + income**2"]
# the penalty is per elemnt in the list of formula:
# penalty_factor = [1]
adjusted = sample_with_target.adjust(
method="ipw",
transformations=transformations,
formula=formula,
# penalty_factor=penalty_factor,
# max_de=None,
)
print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")
# Adding income**2 to the formula led to lower Deff but also lower ASMD reduction.
INFO (2026-01-06 22:16:18,405) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:16:18,406) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:16:18,473) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['age_group + gender + income + income**2']
INFO (2026-01-06 22:16:18,473) [ipw/ipw (line 681)]: The number of columns in the model matrix: 7
INFO (2026-01-06 22:16:18,474) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:16:33,724) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:16:33,724) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:16:33,725) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:16:33,728) [ipw/ipw (line 900)]: Chosen lambda: 0.0574164245593571
INFO (2026-01-06 22:16:33,729) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17296221715396187
Adjustment details:
method: ipw
weight trimming mean ratio: 20
Covariate diagnostics:
Covar ASMD reduction: 60.7%
Covar ASMD (7 variables): 0.327 -> 0.128
Covar mean KLD reduction: 92.9%
Covar mean KLD (3 variables): 0.071 -> 0.005
Weight diagnostics:
design effect (Deff): 1.925
effective sample size proportion (ESSP): 0.519
effective sample size (ESS): 519.5
Outcome weighted means:
happiness
source
self 53.258
target 56.278
unadjusted 48.559
Model performance: Model proportion deviance explained: 0.173
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source self target unadjusted self_ci target_ci unadjusted_ci
happiness 53.258 56.278 48.559 (52.072, 54.445) (55.961, 56.595) (47.669, 49.449)
Response rates (relative to number of respondents in sample):
happiness
n 1000.0
% 100.0
Response rates (relative to notnull rows in the target):
happiness
n 1000.0
% 10.0
Response rates (in the target):
happiness
n 10000.0
% 100.0
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
"income": lambda x: x,
"income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=20),
}
formula = ["age_group + gender", "income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 0.1]
adjusted = sample_with_target.adjust(
method="ipw",
transformations=transformations,
formula=formula,
penalty_factor=penalty_factor,
# max_de=None,
)
print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")
# By adding income_buckets and using it instead of income, as well as putting more weight in it in terms of penalty
# we managed to correct income quite well, but at the expense of age and gender.
INFO (2026-01-06 22:16:35,237) [ipw/ipw (line 622)]: Starting ipw function
INFO (2026-01-06 22:16:35,239) [adjustment/apply_transformations (line 469)]: Adding the variables: ['income_buckets']
INFO (2026-01-06 22:16:35,240) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-01-06 22:16:35,246) [adjustment/apply_transformations (line 507)]: Final variables in output: ['income_buckets', 'age_group', 'gender', 'income']
INFO (2026-01-06 22:16:35,257) [ipw/ipw (line 656)]: Building model matrix
INFO (2026-01-06 22:16:35,354) [ipw/ipw (line 678)]: The formula used to build the model matrix: ['age_group + gender', 'income_buckets']
INFO (2026-01-06 22:16:35,355) [ipw/ipw (line 681)]: The number of columns in the model matrix: 26
INFO (2026-01-06 22:16:35,356) [ipw/ipw (line 682)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:16:58,344) [ipw/ipw (line 843)]: Done with sklearn
INFO (2026-01-06 22:16:58,345) [ipw/ipw (line 845)]: max_de: None
INFO (2026-01-06 22:16:58,345) [ipw/ipw (line 867)]: Starting model selection
INFO (2026-01-06 22:16:58,349) [ipw/ipw (line 900)]: Chosen lambda: 0.09460271806598614
INFO (2026-01-06 22:16:58,350) [ipw/ipw (line 918)]: Proportion null deviance explained 0.17682033095496275
Adjustment details:
method: ipw
weight trimming mean ratio: 20
Covariate diagnostics:
Covar ASMD reduction: 70.0%
Covar ASMD (7 variables): 0.327 -> 0.098
Covar mean KLD reduction: 89.9%
Covar mean KLD (3 variables): 0.071 -> 0.007
Weight diagnostics:
design effect (Deff): 2.391
effective sample size proportion (ESSP): 0.418
effective sample size (ESS): 418.2
Outcome weighted means:
happiness
source
self 52.289
target 56.278
unadjusted 48.559
Model performance: Model proportion deviance explained: 0.177
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source self target unadjusted self_ci target_ci unadjusted_ci
happiness 52.289 56.278 48.559 (51.06, 53.519) (55.961, 56.595) (47.669, 49.449)
Response rates (relative to number of respondents in sample):
happiness
n 1000.0
% 100.0
Response rates (relative to notnull rows in the target):
happiness
n 1000.0
% 10.0
Response rates (in the target):
happiness
n 10000.0
% 100.0
CBPS¶
Let's see if we can improve on CBPS a bit.
# Defaults from the package
adjusted = sample_with_target.adjust(
method = "cbps",
# max_de=None,
)
print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")
# CBPS already corrects a lot. Let's see if we can make it correct a tiny bit more.
INFO (2026-01-06 22:16:59,860) [cbps/cbps (line 436)]: Starting cbps function
INFO (2026-01-06 22:16:59,862) [adjustment/apply_transformations (line 469)]: Adding the variables: []
INFO (2026-01-06 22:16:59,863) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-01-06 22:16:59,873) [adjustment/apply_transformations (line 507)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-01-06 22:16:59,975) [cbps/cbps (line 487)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-01-06 22:16:59,977) [cbps/cbps (line 498)]: The number of columns in the model matrix: 16
INFO (2026-01-06 22:16:59,978) [cbps/cbps (line 499)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:16:59,983) [cbps/cbps (line 568)]: Finding initial estimator for GMM optimization
INFO (2026-01-06 22:17:00,094) [cbps/cbps (line 595)]: Finding initial estimator for GMM optimization that minimizes the balance loss
INFO (2026-01-06 22:17:00,350) [cbps/cbps (line 631)]: Running GMM optimization
INFO (2026-01-06 22:17:00,826) [cbps/cbps (line 758)]: Done cbps function
Adjustment details:
method: cbps
Covariate diagnostics:
Covar ASMD reduction: 77.4%
Covar ASMD (7 variables): 0.327 -> 0.074
Covar mean KLD reduction: 98.3%
Covar mean KLD (3 variables): 0.071 -> 0.001
Weight diagnostics:
design effect (Deff): 2.754
effective sample size proportion (ESSP): 0.363
effective sample size (ESS): 363.1
Outcome weighted means:
happiness
source
self 54.366
target 56.278
unadjusted 48.559
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source self target unadjusted self_ci target_ci unadjusted_ci
happiness 54.366 56.278 48.559 (53.003, 55.73) (55.961, 56.595) (47.669, 49.449)
Response rates (relative to number of respondents in sample):
happiness
n 1000.0
% 100.0
Response rates (relative to notnull rows in the target):
happiness
n 1000.0
% 10.0
Response rates (in the target):
happiness
n 10000.0
% 100.0
import numpy as np
# No transformations at all
transformations = {
"age_group": lambda x: x,
"gender": lambda x: x,
# "income": lambda x: x,
"income_log": lambda x: np.log(x.income.fillna(x.income.mean())),
"income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=5),
}
formula = ["age_group + gender + income_log * income_buckets"]
adjusted = sample_with_target.adjust(
method="cbps",
transformations=transformations,
formula=formula,
# penalty_factor=penalty_factor, # CBPS seems to ignore the penalty factor.
# max_de=None,
)
print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library="seaborn", dist_type="kde")
# Trying various transformations gives slightly different results (some effect on the outcome, Deff and ASMD) - but nothing too major here.
INFO (2026-01-06 22:17:02,373) [cbps/cbps (line 436)]: Starting cbps function
INFO (2026-01-06 22:17:02,376) [adjustment/apply_transformations (line 469)]: Adding the variables: ['income_log', 'income_buckets']
INFO (2026-01-06 22:17:02,376) [adjustment/apply_transformations (line 470)]: Transforming the variables: ['age_group', 'gender']
WARNING (2026-01-06 22:17:02,382) [adjustment/apply_transformations (line 504)]: Dropping the variables: ['income']
INFO (2026-01-06 22:17:02,383) [adjustment/apply_transformations (line 507)]: Final variables in output: ['income_log', 'income_buckets', 'age_group', 'gender']
INFO (2026-01-06 22:17:02,493) [cbps/cbps (line 487)]: The formula used to build the model matrix: ['age_group + gender + income_log * income_buckets']
INFO (2026-01-06 22:17:02,495) [cbps/cbps (line 498)]: The number of columns in the model matrix: 15
INFO (2026-01-06 22:17:02,495) [cbps/cbps (line 499)]: The number of rows in the model matrix: 11000
INFO (2026-01-06 22:17:02,501) [cbps/cbps (line 568)]: Finding initial estimator for GMM optimization
INFO (2026-01-06 22:17:02,622) [cbps/cbps (line 595)]: Finding initial estimator for GMM optimization that minimizes the balance loss
INFO (2026-01-06 22:17:02,943) [cbps/cbps (line 631)]: Running GMM optimization
INFO (2026-01-06 22:17:03,383) [cbps/cbps (line 758)]: Done cbps function
Adjustment details:
method: cbps
Covariate diagnostics:
Covar ASMD reduction: 82.1%
Covar ASMD (7 variables): 0.327 -> 0.059
Covar mean KLD reduction: 98.4%
Covar mean KLD (3 variables): 0.071 -> 0.001
Weight diagnostics:
design effect (Deff): 3.030
effective sample size proportion (ESSP): 0.330
effective sample size (ESS): 330.1
Outcome weighted means:
happiness
source
self 54.432
target 56.278
unadjusted 48.559
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source self target unadjusted self_ci target_ci unadjusted_ci
happiness 54.432 56.278 48.559 (53.042, 55.822) (55.961, 56.595) (47.669, 49.449)
Response rates (relative to number of respondents in sample):
happiness
n 1000.0
% 100.0
Response rates (relative to notnull rows in the target):
happiness
n 1000.0
% 10.0
Response rates (in the target):
happiness
n 10000.0
% 100.0
# Sessions info
import session_info
session_info.show(html=False, dependencies=True)
----- balance 0.14.0 numpy 1.26.4 pandas 2.3.3 session_info v1.0.1 ----- PIL 11.3.0 anyio NA arrow 1.4.0 asttokens NA attr 25.4.0 attrs 25.4.0 babel 2.17.0 certifi 2026.01.04 charset_normalizer 3.4.4 comm 0.2.3 cycler 0.12.1 cython_runtime NA dateutil 2.9.0.post0 debugpy 1.8.19 decorator 5.2.1 defusedxml 0.7.1 exceptiongroup 1.3.1 executing 2.2.1 fastjsonschema NA fqdn NA idna 3.11 importlib_metadata NA importlib_resources NA ipykernel 6.31.0 isoduration NA jedi 0.19.2 jinja2 3.1.6 joblib 1.5.3 json5 0.13.0 jsonpointer 3.0.0 jsonschema 4.25.1 jsonschema_specifications NA jupyter_events 0.12.0 jupyter_server 2.17.0 jupyterlab_server 2.28.0 kiwisolver 1.4.7 lark 1.3.1 markupsafe 3.0.3 matplotlib 3.9.4 matplotlib_inline 0.2.1 mpl_toolkits NA narwhals 2.15.0 nbformat 5.10.4 overrides NA packaging 25.0 parso 0.8.5 patsy 1.0.2 pexpect 4.9.0 platformdirs 4.4.0 plotly 6.5.0 prometheus_client NA prompt_toolkit 3.0.52 psutil 7.2.1 ptyprocess 0.7.0 pure_eval 0.2.3 pydev_ipython NA pydevconsole NA pydevd 3.2.3 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.19.2 pyparsing 3.3.1 pythonjsonlogger NA pytz 2025.2 referencing NA requests 2.32.5 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rfc3987_syntax NA rpds NA scipy 1.13.1 seaborn 0.13.2 send2trash NA six 1.17.0 sklearn 1.3.2 sphinxcontrib NA stack_data 0.6.3 statsmodels 0.14.6 threadpoolctl 3.6.0 tornado 6.5.4 traitlets 5.14.3 typing_extensions NA uri_template NA urllib3 2.6.2 wcwidth 0.2.14 webcolors NA websocket 1.9.0 yaml 6.0.3 zipp NA zmq 27.1.0 zoneinfo NA ----- IPython 8.18.1 jupyter_client 8.6.3 jupyter_core 5.8.1 jupyterlab 4.5.1 notebook 7.5.1 ----- Python 3.9.25 (main, Nov 3 2025, 15:16:36) [GCC 13.3.0] Linux-6.11.0-1018-azure-x86_64-with-glibc2.39 ----- Session information updated at 2026-01-06 22:17