balance: transformations and formulas¶

This tutorial focuses on the ways transformations, formulas and penalty can be included in your pre-processing of the covariates before adjusting for them.

Example dataset - preparing the objects¶

The following is a toy simulated dataset.

For a more basic walkthrough of the elements in the next code block, please take a look at the tutorial: balance Quickstart: Analyzing and adjusting the bias on a simulated toy dataset

InĀ [1]:
from balance import load_data
target_df, sample_df = load_data()
from balance import Sample
sample = Sample.from_frame(sample_df, outcome_columns=["happiness"])
target = Sample.from_frame(target_df, outcome_columns=["happiness"])
sample_with_target = sample.set_target(target)
sample_with_target
INFO (2026-02-21 04:46:32,220) [__init__/<module> (line 72)]: Using balance version 0.16.1
WARNING (2026-02-21 04:46:32,401) [input_validation/guess_id_column (line 337)]: Guessed id column name id for the data
WARNING (2026-02-21 04:46:32,413) [sample_class/from_frame (line 549)]: No weights passed. Adding a 'weight' column and setting all values to 1
WARNING (2026-02-21 04:46:32,424) [input_validation/guess_id_column (line 337)]: Guessed id column name id for the data
balance (Version 0.16.1) loaded:
    šŸ“– Documentation: https://import-balance.org/
    šŸ› ļø Help / Issues: https://github.com/facebookresearch/balance/issues/
    šŸ“„ Citation:
        Sarig, T., Galili, T., & Eilat, R. (2023).
        balance - a Python package for balancing biased data samples.
        https://arxiv.org/abs/2307.06024

    Tip: You can view this message anytime with balance.help()

WARNING (2026-02-21 04:46:32,439) [sample_class/from_frame (line 549)]: No weights passed. Adding a 'weight' column and setting all values to 1
Out[1]:
(balance.sample_class.Sample)

        balance Sample object with target set
        1000 observations x 3 variables: gender,age_group,income
        id_column: id, weight_column: weight,
        outcome_columns: happiness
        
            target:
                 
	        balance Sample object
	        10000 observations x 3 variables: gender,age_group,income
	        id_column: id, weight_column: weight,
	        outcome_columns: happiness
	        
            3 common variables: gender,age_group,income
            

Transformations¶

Basic usage: manipulating existing variables¶

When trying to understand what an adjustment does, we can look at the model_coef items collected from the diagnostics method.

InĀ [2]:
adjusted = sample_with_target.adjust(
    # method="ipw", # default method
    # transformations=None,
    # formula=None,
    # penalty_factor=None, # all 1s
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:46:32,460) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:46:32,462) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:46:32,463) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-02-21 04:46:32,472) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-02-21 04:46:32,481) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:46:32,593) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-02-21 04:46:32,593) [ipw/ipw (line 767)]: The number of columns in the model matrix: 16
INFO (2026-02-21 04:46:32,594) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:46:49,303) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:46:49,304) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:46:49,304) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:46:49,307) [ipw/ipw (line 1047)]: Chosen lambda: 0.041158338186664825
INFO (2026-02-21 04:46:49,308) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.172637976731583
INFO (2026-02-21 04:46:49,314) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:46:49,641) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[2]:
metric val var
52 model_coef 0.138619 intercept
53 model_coef 0.043944 _is_na_gender[T.True]
54 model_coef -0.203732 age_group[T.25-34]
55 model_coef -0.428683 age_group[T.35-44]
56 model_coef -0.529556 age_group[T.45+]
57 model_coef 0.332490 gender[T.Male]
58 model_coef 0.043944 gender[T._NA]
59 model_coef 0.169578 income[Interval(-0.0009997440000000001, 0.44, ...
60 model_coef 0.154197 income[Interval(0.44, 1.664, closed='right')]
61 model_coef 0.111212 income[Interval(1.664, 3.472, closed='right')]
62 model_coef -0.041457 income[Interval(11.312, 15.139, closed='right')]
63 model_coef -0.161148 income[Interval(15.139, 20.567, closed='right')]
64 model_coef -0.211197 income[Interval(20.567, 29.504, closed='right')]
65 model_coef -0.357491 income[Interval(29.504, 128.536, closed='right')]
66 model_coef 0.093738 income[Interval(3.472, 5.663, closed='right')]
67 model_coef 0.072936 income[Interval(5.663, 8.211, closed='right')]
68 model_coef 0.005787 income[Interval(8.211, 11.312, closed='right')]

As we can see from the glm coefficients, the age and gender groups got an extra NA column. And the income variable is bucketed into 10 buckets.

We can change these defaults by deciding on the specific transformation we want.

Let's start with NO transformations.

The transformation argument accepts either a dict or None. None indicates no transformations.

InĀ [3]:
adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=None,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:46:49,659) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:46:49,661) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:46:49,739) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-02-21 04:46:49,740) [ipw/ipw (line 767)]: The number of columns in the model matrix: 8
INFO (2026-02-21 04:46:49,741) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:47:05,940) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:47:05,941) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:47:05,942) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:47:05,944) [ipw/ipw (line 1047)]: Chosen lambda: 0.0368353720078807
INFO (2026-02-21 04:47:05,945) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.17354102148345973
INFO (2026-02-21 04:47:05,952) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:47:06,276) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[3]:
metric val var
52 model_coef 0.998308 intercept
53 model_coef -0.000021 _is_na_gender[T.True]
54 model_coef -0.212995 age_group[T.25-34]
55 model_coef -0.440444 age_group[T.35-44]
56 model_coef -0.545756 age_group[T.45+]
57 model_coef -0.188264 gender[Female]
58 model_coef 0.181577 gender[Male]
59 model_coef -0.000021 gender[_NA]
60 model_coef -0.570540 income

In this setting, income was treated as a numeric variable, with no transformations (e.g.: bucketing) on it. Regardless of the transformations, the model matrix made sure to turn the gender and age_group into dummy variables (including a column for NA).

Next we can fit a simple transformation.

Let's say we wanted to bucket age_groups groups that are smaller than 25% of the data, and use different bucketing on income, here is how we'd do it:

InĀ [4]:
from balance.util import fct_lump, quantize

transformations = {
    "age_group": lambda x: fct_lump(x, 0.25),
    "gender": lambda x: x,
    "income": lambda x: quantize(x.fillna(x.mean()), q=3),
}

adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=transformations,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:47:06,294) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:47:06,296) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:47:06,296) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:47:06,303) [adjustment/apply_transformations (line 469)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:47:06,311) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:47:06,421) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-02-21 04:47:06,421) [ipw/ipw (line 767)]: The number of columns in the model matrix: 8
INFO (2026-02-21 04:47:06,422) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:47:21,188) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:47:21,189) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:47:21,190) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:47:21,192) [ipw/ipw (line 1047)]: Chosen lambda: 0.11811067639400605
INFO (2026-02-21 04:47:21,193) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.0942012400003771
WARNING (2026-02-21 04:47:21,194) [ipw/ipw (line 1073)]: The propensity model has low fraction null deviance explained (0.0942012400003771). Results may not be accurate
INFO (2026-02-21 04:47:21,201) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:47:21,527) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[4]:
metric val var
52 model_coef -0.326556 intercept
53 model_coef 0.031655 _is_na_gender[T.True]
54 model_coef -0.182093 age_group[T.35-44]
55 model_coef 0.098831 age_group[T._lumped_other]
56 model_coef 0.241560 gender[T.Male]
57 model_coef 0.031655 gender[T._NA]
58 model_coef 0.198958 income[Interval(-0.0009997440000000001, 4.194,...
59 model_coef -0.283196 income[Interval(13.693, 128.536, closed='right')]
60 model_coef 0.048154 income[Interval(4.194, 13.693, closed='right')]

As we can see - we managed to change the bucket sizes of income to have only 3 buckets, and we lumped the age_group to two groups (and collapsed together "small" buckets into the _lumped_other bucket).

Lastly, notice that if we omit a variable from transformations, it will not be available for the model construction (This behavior might change in the future).

InĀ [5]:
transformations = {
    # "age_group": lambda x: fct_lump(x, 0.25),
    "gender": lambda x: x,
    # "income": lambda x: quantize(x.fillna(x.mean()), q=3),
}

adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=transformations,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:47:21,544) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:47:21,545) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:47:21,546) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender']
WARNING (2026-02-21 04:47:21,547) [adjustment/apply_transformations (line 466)]: Dropping the variables: ['age_group', 'income']
INFO (2026-02-21 04:47:21,547) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender']
INFO (2026-02-21 04:47:21,550) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:47:21,596) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['gender + _is_na_gender']
INFO (2026-02-21 04:47:21,596) [ipw/ipw (line 767)]: The number of columns in the model matrix: 4
INFO (2026-02-21 04:47:21,597) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:47:31,788) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:47:31,790) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:47:31,790) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:47:31,793) [ipw/ipw (line 1047)]: Chosen lambda: 0.20570886693214954
INFO (2026-02-21 04:47:31,794) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.02648728826600144
WARNING (2026-02-21 04:47:31,795) [ipw/ipw (line 1073)]: The propensity model has low fraction null deviance explained (0.02648728826600144). Results may not be accurate
INFO (2026-02-21 04:47:31,801) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:47:32,122) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[5]:
metric val var
52 model_coef -0.035328 intercept
53 model_coef 0.001072 _is_na_gender[T.True]
54 model_coef -0.141643 gender[Female]
55 model_coef 0.136137 gender[Male]
56 model_coef 0.001072 gender[_NA]

As we can see, only gender was included in the model.

InĀ [6]:
# TODO: add more examples about how add_na works
# TODO: add more examples about rare values in categorical variables and how they are grouped together. 

Creating new variables¶

In the next example we will create several new transformations of income.

The info gives information on which variables were added, which were transformed, and what is the final variables in the output.

The x in the lambda function can have one of two meanings:

  1. When the keys in the dict match the exact names of the variables in the DataFrame (e.g.: "income"), then the lambda function treats x as the pandas.Series of that variable.
  2. If the name of the key does NOT exist in the DataFrame (e.g.: "income_squared"), then x will become the DataFrame of the data.
InĀ [7]:
from balance.util import fct_lump, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_squared": lambda x: x.income**2,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=3),
}

adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=transformations,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:47:32,143) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:47:32,145) [adjustment/apply_transformations (line 433)]: Adding the variables: ['income_squared', 'income_buckets']
INFO (2026-02-21 04:47:32,146) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:47:32,151) [adjustment/apply_transformations (line 469)]: Final variables in output: ['income_squared', 'income_buckets', 'age_group', 'gender', 'income']
INFO (2026-02-21 04:47:32,163) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:47:32,277) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income_squared + income_buckets + income + gender + age_group + _is_na_gender']
INFO (2026-02-21 04:47:32,278) [ipw/ipw (line 767)]: The number of columns in the model matrix: 11
INFO (2026-02-21 04:47:32,278) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:47:52,378) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:47:52,379) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:47:52,380) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:47:52,385) [ipw/ipw (line 1047)]: Chosen lambda: 0.043506507030756265
INFO (2026-02-21 04:47:52,386) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.17222993109973506
INFO (2026-02-21 04:47:52,394) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:47:52,718) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[7]:
metric val var
52 model_coef 0.412704 intercept
53 model_coef 0.044661 _is_na_gender[T.True]
54 model_coef -0.194322 age_group[T.25-34]
55 model_coef -0.421090 age_group[T.35-44]
56 model_coef -0.521683 age_group[T.45+]
57 model_coef 0.326123 gender[T.Male]
58 model_coef 0.044661 gender[T._NA]
59 model_coef -0.264665 income
60 model_coef 0.115075 income_buckets[Interval(-0.0009997440000000001...
61 model_coef -0.165803 income_buckets[Interval(13.693, 128.536, close...
62 model_coef 0.033703 income_buckets[Interval(4.194, 13.693, closed=...
63 model_coef -0.186473 income_squared

Formula¶

The formula can accept a list of strings indicating how to combine the transformed variables together. It follows the formula notation from patsy.

For example, we can have an interaction between age_group and gender:

InĀ [8]:
from balance.util import fct_lump_by, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: quantize(x.fillna(x.mean()), q=20),
}
formula = ["age_group * gender"]
# the penalty is per elemnt in the list of formula:
# penalty_factor = [0.1, 0.1, 0.1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:47:52,738) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:47:52,740) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:47:52,740) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:47:52,745) [adjustment/apply_transformations (line 469)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:47:52,754) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:47:52,832) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['age_group * gender']
INFO (2026-02-21 04:47:52,832) [ipw/ipw (line 767)]: The number of columns in the model matrix: 12
INFO (2026-02-21 04:47:52,833) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:48:10,162) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:48:10,163) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:48:10,163) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:48:10,166) [ipw/ipw (line 1047)]: Chosen lambda: 0.0894967426547247
INFO (2026-02-21 04:48:10,167) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.11496433010847928
INFO (2026-02-21 04:48:10,173) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:48:10,495) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[8]:
metric val var
52 model_coef -0.349343 intercept
53 model_coef 0.334719 age_group[18-24]
54 model_coef 0.005834 age_group[25-34]
55 model_coef -0.160262 age_group[35-44]
56 model_coef -0.280376 age_group[45+]
57 model_coef 0.031766 age_group[T.25-34]:gender[T.Male]
58 model_coef 0.016422 age_group[T.25-34]:gender[T._NA]
59 model_coef -0.037757 age_group[T.35-44]:gender[T.Male]
60 model_coef -0.046515 age_group[T.35-44]:gender[T._NA]
61 model_coef -0.032210 age_group[T.45+]:gender[T.Male]
62 model_coef -0.042490 age_group[T.45+]:gender[T._NA]
63 model_coef 0.272436 gender[T.Male]
64 model_coef 0.062867 gender[T._NA]

As we can see, the formula makes it so that we have combinations of age_group and gender, as well as a main effects of age_group and gender. Since income was not in the formula, it is not included in the model.

descriptive_stats with formulas¶

You can also use formulas when computing descriptive statistics to control which terms or dummy variables are included in the summary. This is helpful when you want summary statistics for a subset of columns or for specific categorical expansions.

InĀ [9]:
from balance.stats_and_plots.weighted_stats import descriptive_stats
import pandas as pd

df = pd.DataFrame({"num": [1, 2, 3], "group": ["a", "b", "a"]})

# Only summarize the numeric column
descriptive_stats(df, stat="mean", formula="num")

# Summarize the categorical column via its dummy variables
descriptive_stats(df, stat="mean", formula="group")
Out[9]:
group[a] group[b]
0 0.666667 0.333333

Formula and penalty_factor¶

The formula can be provided as several strings, and then the penalty factor can indicate how much the model should focus to adjust to that element of the formula. Larger penalty factors means that element will be less corrected.

The next two examples shows how in one case we focus on correcting for income, and in the second case we focus to correct for age and gender.

InĀ [10]:
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
}
formula = ["age_group + gender", "income"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [10, 0.1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:48:10,533) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:48:10,535) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:48:10,536) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:48:10,537) [adjustment/apply_transformations (line 469)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:48:10,543) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:48:10,622) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['age_group + gender', 'income']
INFO (2026-02-21 04:48:10,622) [ipw/ipw (line 767)]: The number of columns in the model matrix: 7
INFO (2026-02-21 04:48:10,623) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:48:42,117) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:48:42,118) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:48:42,119) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:48:42,122) [ipw/ipw (line 1047)]: Chosen lambda: 0.0009460271806598614
INFO (2026-02-21 04:48:42,122) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.1727205955628286
INFO (2026-02-21 04:48:42,130) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:48:42,453) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[10]:
metric val var
52 model_coef 0.243947 intercept
53 model_coef 3.240641 age_group[18-24]
54 model_coef 0.393603 age_group[25-34]
55 model_coef -1.760098 age_group[35-44]
56 model_coef -2.917924 age_group[45+]
57 model_coef 2.596568 gender[T.Male]
58 model_coef 0.486109 gender[T._NA]
59 model_coef -0.073752 income

The above example corrected more to income. As we can see, age and gender got 0 correction (since their penalty was so high). Let's now over correct for age and gender:

InĀ [11]:
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
}
formula = ["age_group + gender", "income"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [0.1, 10]  # this is flipped

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:48:42,471) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:48:42,473) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:48:42,473) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:48:42,475) [adjustment/apply_transformations (line 469)]: Final variables in output: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:48:42,481) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:48:42,559) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['age_group + gender', 'income']
INFO (2026-02-21 04:48:42,560) [ipw/ipw (line 767)]: The number of columns in the model matrix: 7
INFO (2026-02-21 04:48:42,560) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:49:21,217) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:49:21,218) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:49:21,219) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:49:21,224) [ipw/ipw (line 1047)]: Chosen lambda: 0.001
INFO (2026-02-21 04:49:21,225) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.17304814560907622
INFO (2026-02-21 04:49:21,234) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:49:21,555) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[11]:
metric val var
52 model_coef -0.412403 intercept
53 model_coef 0.053595 age_group[18-24]
54 model_coef 0.014833 age_group[25-34]
55 model_coef -0.014770 age_group[35-44]
56 model_coef -0.037396 age_group[45+]
57 model_coef 0.041933 gender[T.Male]
58 model_coef 0.011918 gender[T._NA]
59 model_coef -4.208587 income

In the above case, income basically got 0 correction.

We can add two versions of income, and give each of them a higher penalty than the age and gender:

InĀ [12]:
from balance.util import fct_lump_by, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=4),
}
formula = ["age_group + gender", "income", "income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 2, 2]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:49:21,572) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:49:21,575) [adjustment/apply_transformations (line 433)]: Adding the variables: ['income_buckets']
INFO (2026-02-21 04:49:21,575) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:49:21,580) [adjustment/apply_transformations (line 469)]: Final variables in output: ['income_buckets', 'age_group', 'gender', 'income']
INFO (2026-02-21 04:49:21,590) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:49:21,704) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['age_group + gender', 'income', 'income_buckets']
INFO (2026-02-21 04:49:21,704) [ipw/ipw (line 767)]: The number of columns in the model matrix: 11
INFO (2026-02-21 04:49:21,705) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:49:40,990) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:49:40,991) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:49:40,992) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:49:40,995) [ipw/ipw (line 1047)]: Chosen lambda: 0.043506507030756265
INFO (2026-02-21 04:49:40,996) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.17297141960904983
INFO (2026-02-21 04:49:41,002) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:49:41,326) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[12]:
metric val var
52 model_coef -0.291644 intercept
53 model_coef 0.380432 age_group[18-24]
54 model_coef 0.049269 age_group[25-34]
55 model_coef -0.198208 age_group[35-44]
56 model_coef -0.349693 age_group[45+]
57 model_coef 0.322417 gender[T.Male]
58 model_coef 0.074613 gender[T._NA]
59 model_coef -0.418615 income
60 model_coef 0.217427 income_buckets[Interval(-0.0009997440000000001...
61 model_coef -0.366163 income_buckets[Interval(17.694, 128.536, close...
62 model_coef 0.130635 income_buckets[Interval(2.53, 8.211, closed='r...
63 model_coef -0.050271 income_buckets[Interval(8.211, 17.694, closed=...

Another way is to create a formula for several variations of each variable, and give each a penalty of 1. For example:

InĀ [13]:
from balance.util import fct_lump_by, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=4),
}
formula = ["age_group", "gender", "income + income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 1, 1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")
INFO (2026-02-21 04:49:41,345) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:49:41,347) [adjustment/apply_transformations (line 433)]: Adding the variables: ['income_buckets']
INFO (2026-02-21 04:49:41,348) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:49:41,353) [adjustment/apply_transformations (line 469)]: Final variables in output: ['income_buckets', 'age_group', 'gender', 'income']
INFO (2026-02-21 04:49:41,363) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:49:41,477) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['age_group', 'gender', 'income + income_buckets']
INFO (2026-02-21 04:49:41,478) [ipw/ipw (line 767)]: The number of columns in the model matrix: 12
INFO (2026-02-21 04:49:41,479) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:49:57,818) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:49:57,820) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:49:57,820) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:49:57,824) [ipw/ipw (line 1047)]: Chosen lambda: 0.0894967426547247
INFO (2026-02-21 04:49:57,825) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.172976573763568
INFO (2026-02-21 04:49:57,833) [sample_class/diagnostics (line 1827)]: Starting computation of diagnostics of the fitting
INFO (2026-02-21 04:49:58,153) [sample_class/diagnostics (line 2073)]: Done computing diagnostics
Out[13]:
metric val var
52 model_coef 0.082009 intercept
53 model_coef 0.328090 age_group[18-24]
54 model_coef 0.039778 age_group[25-34]
55 model_coef -0.175993 age_group[35-44]
56 model_coef -0.296406 age_group[45+]
57 model_coef -0.163344 gender[Female]
58 model_coef 0.159121 gender[Male]
59 model_coef 0.000064 gender[_NA]
60 model_coef -0.258803 income
61 model_coef 0.122206 income_buckets[Interval(-0.0009997440000000001...
62 model_coef -0.213231 income_buckets[Interval(17.694, 128.536, close...
63 model_coef 0.076380 income_buckets[Interval(2.53, 8.211, closed='r...
64 model_coef -0.024929 income_buckets[Interval(8.211, 17.694, closed=...

The impact of transformations and formulas¶

ipw¶

Using the above can have an impact on the final design effect, ASMD, and outcome. Here are several simple examples.

InĀ [14]:
# Defaults from the package

adjusted = sample_with_target.adjust(
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")
INFO (2026-02-21 04:49:58,171) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:49:58,173) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:49:58,174) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-02-21 04:49:58,182) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-02-21 04:49:58,191) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:49:58,301) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-02-21 04:49:58,302) [ipw/ipw (line 767)]: The number of columns in the model matrix: 16
INFO (2026-02-21 04:49:58,302) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:50:15,127) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:50:15,128) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:50:15,129) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:50:15,131) [ipw/ipw (line 1047)]: Chosen lambda: 0.041158338186664825
INFO (2026-02-21 04:50:15,132) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.172637976731583
Adjustment details:
    method: ipw
    weight trimming mean ratio: 20
Covariate diagnostics:
    Covar ASMD reduction: 63.4%
    Covar ASMD (7 variables): 0.327 -> 0.120
    Covar mean KLD reduction: 92.3%
    Covar mean KLD (3 variables): 0.157 -> 0.012
Weight diagnostics:
    design effect (Deff): 1.880
    effective sample size proportion (ESSP): 0.532
    effective sample size (ESS): 531.9
Outcome weighted means:
            happiness
source               
self           53.295
target         56.278
unadjusted     48.559
Model performance: Model proportion deviance explained: 0.173
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source       self  target  unadjusted           self_ci         target_ci     unadjusted_ci
happiness  53.295  56.278      48.559  (52.096, 54.495)  (55.961, 56.595)  (47.669, 49.449)

Weights impact on outcomes (t_test):
           mean_yw0  mean_yw1  mean_diff  diff_ci_lower  diff_ci_upper  t_stat  p_value       n
outcome                                                                                        
happiness    48.559    53.295      4.736          1.312          8.161   2.714    0.007  1000.0

Response rates (relative to number of respondents in sample):
   happiness
n     1000.0
%      100.0
Response rates (relative to notnull rows in the target):
    happiness
n     1000.0
%       10.0
Response rates (in the target):
    happiness
n    10000.0
%      100.0

No description has been provided for this image
InĀ [15]:
# No transformations at all

# transformations = None is just like using:
# transformations = {
#     "age_group": lambda x: x,
#     "gender": lambda x: x,
#     "income": lambda x: x,
# }

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=None,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# slightly smaller design effect, slightly better ASMD reduction.
INFO (2026-02-21 04:50:16,638) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:50:16,639) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:50:16,720) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-02-21 04:50:16,720) [ipw/ipw (line 767)]: The number of columns in the model matrix: 8
INFO (2026-02-21 04:50:16,721) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:50:32,908) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:50:32,909) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:50:32,910) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:50:32,913) [ipw/ipw (line 1047)]: Chosen lambda: 0.0368353720078807
INFO (2026-02-21 04:50:32,914) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.17354102148345973
Adjustment details:
    method: ipw
    weight trimming mean ratio: 20
Covariate diagnostics:
    Covar ASMD reduction: 68.5%
    Covar ASMD (7 variables): 0.327 -> 0.103
    Covar mean KLD reduction: 94.1%
    Covar mean KLD (3 variables): 0.157 -> 0.009
Weight diagnostics:
    design effect (Deff): 2.087
    effective sample size proportion (ESSP): 0.479
    effective sample size (ESS): 479.1
Outcome weighted means:
            happiness
source               
self           53.731
target         56.278
unadjusted     48.559
Model performance: Model proportion deviance explained: 0.174
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source       self  target  unadjusted           self_ci         target_ci     unadjusted_ci
happiness  53.731  56.278      48.559  (52.513, 54.949)  (55.961, 56.595)  (47.669, 49.449)

Weights impact on outcomes (t_test):
           mean_yw0  mean_yw1  mean_diff  diff_ci_lower  diff_ci_upper  t_stat  p_value       n
outcome                                                                                        
happiness    48.559    53.731      5.172          1.461          8.883   2.735    0.006  1000.0

Response rates (relative to number of respondents in sample):
   happiness
n     1000.0
%      100.0
Response rates (relative to notnull rows in the target):
    happiness
n     1000.0
%       10.0
Response rates (in the target):
    happiness
n    10000.0
%      100.0

No description has been provided for this image
InĀ [16]:
# No transformations at all
transformations = None
# But passing a squared term of income to the formula:
formula = ["age_group + gender + income + income**2"]
# the penalty is per elemnt in the list of formula:
# penalty_factor = [1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# Adding income**2 to the formula led to lower Deff but also lower ASMD reduction.
INFO (2026-02-21 04:50:34,562) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:50:34,564) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:50:34,643) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['age_group + gender + income + income**2']
INFO (2026-02-21 04:50:34,643) [ipw/ipw (line 767)]: The number of columns in the model matrix: 7
INFO (2026-02-21 04:50:34,644) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:50:50,255) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:50:50,256) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:50:50,257) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:50:50,260) [ipw/ipw (line 1047)]: Chosen lambda: 0.0574164245593571
INFO (2026-02-21 04:50:50,260) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.172963421441365
Adjustment details:
    method: ipw
    weight trimming mean ratio: 20
Covariate diagnostics:
    Covar ASMD reduction: 60.7%
    Covar ASMD (7 variables): 0.327 -> 0.128
    Covar mean KLD reduction: 93.6%
    Covar mean KLD (3 variables): 0.157 -> 0.010
Weight diagnostics:
    design effect (Deff): 1.925
    effective sample size proportion (ESSP): 0.519
    effective sample size (ESS): 519.5
Outcome weighted means:
            happiness
source               
self           53.259
target         56.278
unadjusted     48.559
Model performance: Model proportion deviance explained: 0.173
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source       self  target  unadjusted           self_ci         target_ci     unadjusted_ci
happiness  53.259  56.278      48.559  (52.073, 54.446)  (55.961, 56.595)  (47.669, 49.449)

Weights impact on outcomes (t_test):
           mean_yw0  mean_yw1  mean_diff  diff_ci_lower  diff_ci_upper  t_stat  p_value       n
outcome                                                                                        
happiness    48.559    53.259      4.701          1.289          8.112   2.704    0.007  1000.0

Response rates (relative to number of respondents in sample):
   happiness
n     1000.0
%      100.0
Response rates (relative to notnull rows in the target):
    happiness
n     1000.0
%       10.0
Response rates (in the target):
    happiness
n    10000.0
%      100.0

No description has been provided for this image
InĀ [17]:
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=20),
}
formula = ["age_group + gender", "income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 0.1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# By adding income_buckets and using it instead of income, as well as putting more weight in it in terms of penalty
# we managed to correct income quite well, but at the expense of age and gender.
INFO (2026-02-21 04:50:51,775) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-02-21 04:50:51,777) [adjustment/apply_transformations (line 433)]: Adding the variables: ['income_buckets']
INFO (2026-02-21 04:50:51,778) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender', 'income']
INFO (2026-02-21 04:50:51,783) [adjustment/apply_transformations (line 469)]: Final variables in output: ['income_buckets', 'age_group', 'gender', 'income']
INFO (2026-02-21 04:50:51,795) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-02-21 04:50:51,908) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['age_group + gender', 'income_buckets']
INFO (2026-02-21 04:50:51,909) [ipw/ipw (line 767)]: The number of columns in the model matrix: 26
INFO (2026-02-21 04:50:51,909) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:51:19,290) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-02-21 04:51:19,291) [ipw/ipw (line 992)]: max_de: None
INFO (2026-02-21 04:51:19,291) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-02-21 04:51:19,296) [ipw/ipw (line 1047)]: Chosen lambda: 0.09460271806598614
INFO (2026-02-21 04:51:19,297) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.17680429430940325
Adjustment details:
    method: ipw
    weight trimming mean ratio: 20
Covariate diagnostics:
    Covar ASMD reduction: 69.9%
    Covar ASMD (7 variables): 0.327 -> 0.098
    Covar mean KLD reduction: 88.6%
    Covar mean KLD (3 variables): 0.157 -> 0.018
Weight diagnostics:
    design effect (Deff): 2.390
    effective sample size proportion (ESSP): 0.418
    effective sample size (ESS): 418.4
Outcome weighted means:
            happiness
source               
self           52.287
target         56.278
unadjusted     48.559
Model performance: Model proportion deviance explained: 0.177
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source       self  target  unadjusted           self_ci         target_ci     unadjusted_ci
happiness  52.287  56.278      48.559  (51.058, 53.517)  (55.961, 56.595)  (47.669, 49.449)

Weights impact on outcomes (t_test):
           mean_yw0  mean_yw1  mean_diff  diff_ci_lower  diff_ci_upper  t_stat  p_value       n
outcome                                                                                        
happiness    48.559    52.287      3.728          -0.21          7.666   1.858    0.063  1000.0

Response rates (relative to number of respondents in sample):
   happiness
n     1000.0
%      100.0
Response rates (relative to notnull rows in the target):
    happiness
n     1000.0
%       10.0
Response rates (in the target):
    happiness
n    10000.0
%      100.0

No description has been provided for this image
InĀ [Ā ]:
 

CBPS¶

Let's see if we can improve on CBPS a bit.

InĀ [18]:
# Defaults from the package

adjusted = sample_with_target.adjust(
    method = "cbps",
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# CBPS already corrects a lot. Let's see if we can make it correct a tiny bit more.
INFO (2026-02-21 04:51:20,793) [cbps/cbps (line 537)]: Starting cbps function
INFO (2026-02-21 04:51:20,796) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-02-21 04:51:20,796) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-02-21 04:51:20,804) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-02-21 04:51:20,923) [cbps/cbps (line 588)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-02-21 04:51:20,925) [cbps/cbps (line 599)]: The number of columns in the model matrix: 16
INFO (2026-02-21 04:51:20,925) [cbps/cbps (line 600)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:51:20,935) [cbps/cbps (line 669)]: Finding initial estimator for GMM optimization
INFO (2026-02-21 04:51:21,079) [cbps/cbps (line 696)]: Finding initial estimator for GMM optimization that minimizes the balance loss
INFO (2026-02-21 04:51:22,568) [cbps/cbps (line 732)]: Running GMM optimization
INFO (2026-02-21 04:51:24,158) [cbps/cbps (line 859)]: Done cbps function
Adjustment details:
    method: cbps
Covariate diagnostics:
    Covar ASMD reduction: 77.4%
    Covar ASMD (7 variables): 0.327 -> 0.074
    Covar mean KLD reduction: 98.2%
    Covar mean KLD (3 variables): 0.157 -> 0.003
Weight diagnostics:
    design effect (Deff): 2.754
    effective sample size proportion (ESSP): 0.363
    effective sample size (ESS): 363.1
Outcome weighted means:
            happiness
source               
self           54.366
target         56.278
unadjusted     48.559
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source       self  target  unadjusted          self_ci         target_ci     unadjusted_ci
happiness  54.366  56.278      48.559  (53.003, 55.73)  (55.961, 56.595)  (47.669, 49.449)

Weights impact on outcomes (t_test):
           mean_yw0  mean_yw1  mean_diff  diff_ci_lower  diff_ci_upper  t_stat  p_value       n
outcome                                                                                        
happiness    48.559    54.366      5.807          0.911         10.703   2.328     0.02  1000.0

Response rates (relative to number of respondents in sample):
   happiness
n     1000.0
%      100.0
Response rates (relative to notnull rows in the target):
    happiness
n     1000.0
%       10.0
Response rates (in the target):
    happiness
n    10000.0
%      100.0

No description has been provided for this image
InĀ [19]:
import numpy as np

# No transformations at all
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    # "income": lambda x: x,
    "income_log": lambda x: np.log(x.income.fillna(x.income.mean())),
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=5),
}
formula = ["age_group + gender + income_log * income_buckets"]

adjusted = sample_with_target.adjust(
    method="cbps",
    transformations=transformations,
    formula=formula,
    # penalty_factor=penalty_factor, # CBPS seems to ignore the penalty factor.
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library="seaborn", dist_type="kde")

# Trying various transformations gives slightly different results (some effect on the outcome, Deff and ASMD) - but nothing too major here.
INFO (2026-02-21 04:51:25,624) [cbps/cbps (line 537)]: Starting cbps function
INFO (2026-02-21 04:51:25,626) [adjustment/apply_transformations (line 433)]: Adding the variables: ['income_log', 'income_buckets']
INFO (2026-02-21 04:51:25,626) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['age_group', 'gender']
WARNING (2026-02-21 04:51:25,632) [adjustment/apply_transformations (line 466)]: Dropping the variables: ['income']
INFO (2026-02-21 04:51:25,633) [adjustment/apply_transformations (line 469)]: Final variables in output: ['income_log', 'income_buckets', 'age_group', 'gender']
INFO (2026-02-21 04:51:25,754) [cbps/cbps (line 588)]: The formula used to build the model matrix: ['age_group + gender + income_log * income_buckets']
INFO (2026-02-21 04:51:25,756) [cbps/cbps (line 599)]: The number of columns in the model matrix: 15
INFO (2026-02-21 04:51:25,757) [cbps/cbps (line 600)]: The number of rows in the model matrix: 11000
INFO (2026-02-21 04:51:25,767) [cbps/cbps (line 669)]: Finding initial estimator for GMM optimization
INFO (2026-02-21 04:51:25,918) [cbps/cbps (line 696)]: Finding initial estimator for GMM optimization that minimizes the balance loss
INFO (2026-02-21 04:51:27,437) [cbps/cbps (line 732)]: Running GMM optimization
INFO (2026-02-21 04:51:28,895) [cbps/cbps (line 859)]: Done cbps function
Adjustment details:
    method: cbps
Covariate diagnostics:
    Covar ASMD reduction: 82.1%
    Covar ASMD (7 variables): 0.327 -> 0.059
    Covar mean KLD reduction: 98.4%
    Covar mean KLD (3 variables): 0.157 -> 0.002
Weight diagnostics:
    design effect (Deff): 3.030
    effective sample size proportion (ESSP): 0.330
    effective sample size (ESS): 330.1
Outcome weighted means:
            happiness
source               
self           54.432
target         56.278
unadjusted     48.559
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source       self  target  unadjusted           self_ci         target_ci     unadjusted_ci
happiness  54.432  56.278      48.559  (53.042, 55.822)  (55.961, 56.595)  (47.669, 49.449)

Weights impact on outcomes (t_test):
           mean_yw0  mean_yw1  mean_diff  diff_ci_lower  diff_ci_upper  t_stat  p_value       n
outcome                                                                                        
happiness    48.559    54.432      5.873          0.735         11.011   2.243    0.025  1000.0

Response rates (relative to number of respondents in sample):
   happiness
n     1000.0
%      100.0
Response rates (relative to notnull rows in the target):
    happiness
n     1000.0
%       10.0
Response rates (in the target):
    happiness
n    10000.0
%      100.0

No description has been provided for this image
InĀ [20]:
# Sessions info
import session_info
session_info.show(html=False, dependencies=True)
-----
balance             0.16.1
numpy               2.4.2
pandas              3.0.1
session_info        v1.0.1
-----
PIL                         12.1.1
anyio                       NA
arrow                       1.4.0
asttokens                   NA
attr                        25.4.0
attrs                       25.4.0
babel                       2.18.0
certifi                     2026.01.04
charset_normalizer          3.4.4
comm                        0.2.3
cycler                      0.12.1
cython_runtime              NA
dateutil                    2.9.0.post0
debugpy                     1.8.20
decorator                   5.2.1
defusedxml                  0.7.1
executing                   2.2.1
fastjsonschema              NA
fqdn                        NA
idna                        3.11
ipykernel                   7.2.0
isoduration                 NA
jedi                        0.19.2
jinja2                      3.1.6
joblib                      1.5.3
json5                       0.13.0
jsonpointer                 3.0.0
jsonschema                  4.26.0
jsonschema_specifications   NA
jupyter_events              0.12.0
jupyter_server              2.17.0
jupyterlab_server           2.28.0
kiwisolver                  1.4.9
lark                        1.3.1
markupsafe                  3.0.3
matplotlib                  3.10.8
matplotlib_inline           0.2.1
mpl_toolkits                NA
narwhals                    2.16.0
nbformat                    5.10.4
packaging                   26.0
parso                       0.8.6
patsy                       1.0.2
platformdirs                4.9.2
plotly                      6.5.2
prometheus_client           NA
prompt_toolkit              3.0.52
psutil                      7.2.2
pure_eval                   0.2.3
pydev_ipython               NA
pydevconsole                NA
pydevd                      3.2.3
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.19.2
pyparsing                   3.3.2
pythonjsonlogger            NA
referencing                 NA
requests                    2.32.5
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rfc3987_syntax              NA
rpds                        NA
scipy                       1.17.0
seaborn                     0.13.2
send2trash                  NA
six                         1.17.0
sklearn                     1.8.0
sphinxcontrib               NA
stack_data                  0.6.3
statsmodels                 0.14.6
threadpoolctl               3.6.0
tornado                     6.5.4
traitlets                   5.14.3
typing_extensions           NA
uri_template                NA
urllib3                     2.6.3
wcwidth                     0.6.0
webcolors                   NA
websocket                   1.9.0
yaml                        6.0.3
zmq                         27.1.0
zoneinfo                    NA
-----
IPython             9.10.0
jupyter_client      8.8.0
jupyter_core        5.9.1
jupyterlab          4.5.4
notebook            7.5.3
-----
Python 3.12.12 (main, Oct 10 2025, 01:01:16) [GCC 13.3.0]
Linux-6.11.0-1018-azure-x86_64-with-glibc2.39
-----
Session information updated at 2026-02-21 04:51