balance Quickstart: New API (SampleFrame / BalanceFrame)¶
This tutorial demonstrates the new SampleFrame + BalanceFrame API
introduced in balance 0.18.0. It mirrors the original
balance_quickstart.ipynb step-by-step, but uses only the new classes —
no Sample object is needed.
Why a new API?¶
Old API (Sample) |
New API (SampleFrame / BalanceFrame) |
|---|---|
| Column roles inferred by exclusion | Column roles declared explicitly |
Mutable .set_target() / .adjust() |
Immutable — adjust() returns a new object |
| One class does everything | Clear separation: data container vs. adjustment orchestrator |
| Weight provenance not tracked | Weight metadata recorded per-column |
The old Sample API still works and is fully supported; this notebook
simply shows how to do the same analysis with the new classes.
Analysis¶
There are four main steps to analysis with the new API:
- Load data into pandas DataFrames
- Create
SampleFrameobjects with explicit column roles - Build a
BalanceFrame, adjust, and inspect diagnostics - Output results (CSV, download)
Example dataset¶
The following is a toy simulated dataset (same data used in the original quickstart).
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from balance import load_data
target_df, sample_df = load_data()
print("target_df: \n", target_df.head())
print("sample_df: \n", sample_df.head())
INFO (2026-04-09 17:19:51,344) [__init__/<module> (line 75)]: Using balance version 0.19.0
INFO (2026-04-09 17:19:51,344) [__init__/<module> (line 80)]:
balance (Version 0.19.0) loaded:
📖 Documentation: https://import-balance.org/
🛠️ Help / Issues: https://github.com/facebookresearch/balance/issues/
📄 Citation:
Sarig, T., Galili, T., & Eilat, R. (2023).
balance - a Python package for balancing biased data samples.
https://arxiv.org/abs/2307.06024
Tip: You can view this message anytime with balance.help()
target_df:
id gender age_group income happiness
0 100000 Male 45+ 10.183951 61.706333
1 100001 Male 45+ 6.036858 79.123670
2 100002 Male 35-44 5.226629 44.206949
3 100003 NaN 45+ 5.752147 83.985716
4 100004 NaN 25-34 4.837484 49.339713
sample_df:
id gender age_group income happiness
0 0 Male 25-34 6.428659 26.043029
1 1 Female 18-24 9.940280 66.885485
2 2 Male 18-24 2.673623 37.091922
3 3 NaN 18-24 10.550308 49.394050
4 4 NaN 18-24 2.689994 72.304208
In practice, you can use pandas.read_csv() (or any pandas loader) to
import your own data. The new API also provides SampleFrame.from_csv()
for a one-step shortcut.
Load data into SampleFrame objects¶
With the old API you would call Sample.from_frame(df). The new API
uses SampleFrame.from_frame() where you explicitly declare which
columns are covariates, outcomes, etc. If you omit these arguments,
the factory auto-detects roles the same way Sample does (by exclusion
from the id and weight columns).
from balance import SampleFrame, BalanceFrame
sample_sf = SampleFrame.from_frame(
sample_df,
outcome_columns=["happiness"],
)
# Often times we don't have the outcome for the target.
# In this case we've added it just to validate later that
# the weights indeed help us reduce the bias.
target_sf = SampleFrame.from_frame(
target_df,
outcome_columns=["happiness"],
)
WARNING (2026-04-09 17:19:51,395) [input_validation/guess_id_column (line 336)]: Guessed id column name id for the data
WARNING (2026-04-09 17:19:51,407) [sample_frame/from_frame (line 326)]: No weights passed. Adding a 'weight' column and setting all values to 1
WARNING (2026-04-09 17:19:51,409) [input_validation/guess_id_column (line 336)]: Guessed id column name id for the data
WARNING (2026-04-09 17:19:51,424) [sample_frame/from_frame (line 326)]: No weights passed. Adding a 'weight' column and setting all values to 1
Inspecting a SampleFrame¶
You can inspect the column roles and data shape at any time.
Unlike Sample.df, each role is a separate property — no magic
"everything-that's-left" inference.
print(f"Covariates: {sample_sf.covar_columns}")
print(f"Outcomes: {sample_sf.outcome_columns}")
print(f"Weight cols: {sample_sf.weight_columns_all}")
print(f"Active wt: {sample_sf.weight_column}")
print(f"ID column: {sample_sf.id_column_name}")
print(f"Rows: {len(sample_sf)}")
Covariates: ['gender', 'age_group', 'income'] Outcomes: ['happiness'] Weight cols: ['weight'] Active wt: weight ID column: id Rows: 1000
sample_sf.df.info()
<class 'pandas.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1000 non-null str 1 gender 912 non-null str 2 age_group 1000 non-null str 3 income 1000 non-null float64 4 happiness 1000 non-null float64 5 weight 1000 non-null float64 dtypes: float64(3), str(3) memory usage: 47.0 KB
print(sample_sf)
print(target_sf)
SampleFrame: 1000 observations x 3 covariates: gender,age_group,income id_column: id, weight_columns_all: ['weight'], outcome_columns: happiness SampleFrame: 10000 observations x 3 covariates: gender,age_group,income id_column: id, weight_columns_all: ['weight'], outcome_columns: happiness
Create a BalanceFrame¶
With the old API you would call sample.set_target(target). The new
API constructs a BalanceFrame directly from two SampleFrame objects.
A BalanceFrame is immutable — adjust() returns a new
BalanceFrame rather than mutating the existing one.
bf = BalanceFrame(sample=sample_sf, target=target_sf)
print(bf)
balance Sample object with target set
1000 observations x 3 variables: gender,age_group,income
id_column: id, weight_column: weight,
outcome_columns: happiness
target:
SampleFrame: 10000 observations x 3 covariates: gender,age_group,income
id_column: id, weight_columns_all: ['weight'], outcome_columns: happiness
3 common variables: gender,age_group,income
Pre-Adjustment Diagnostics¶
The .covars(), .weights(), and .outcomes() methods return the
same BalanceDFCovars / BalanceDFWeights / BalanceDFOutcomes
objects as the old API. All of .mean(), .asmd(), .plot(), etc.
work identically.
print(bf.covars().mean().T)
source self target _is_na_gender[T.True] 0.088000 0.089800 age_group[T.25-34] 0.300000 0.297400 age_group[T.35-44] 0.156000 0.299200 age_group[T.45+] 0.053000 0.206300 gender[Female] 0.268000 0.455100 gender[Male] 0.644000 0.455100 gender[_NA] 0.088000 0.089800 income 6.297302 12.737608
print(bf.covars().asmd().T)
source self age_group[T.25-34] 0.005688 age_group[T.35-44] 0.312711 age_group[T.45+] 0.378828 gender[Female] 0.375699 gender[Male] 0.379314 gender[_NA] 0.006296 income 0.494217 mean(asmd) 0.326799
print(bf.covars().asmd(aggregate_by_main_covar=True).T)
source self age_group 0.232409 gender 0.253769 income 0.494217 mean(asmd) 0.326799
Visualizing the unadjusted comparison¶
bf.covars().plot()
Adjusting Sample to Population¶
The default method is 'ipw' (inverse probability/propensity weights
via logistic regression with lasso regularization).
Key difference from the old API: adjust() returns a new
BalanceFrame — the original bf is unchanged.
adjusted = bf.adjust()
print(adjusted)
INFO (2026-04-09 17:19:57,466) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-04-09 17:19:57,468) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-09 17:19:57,469) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-04-09 17:19:57,480) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-04-09 17:19:57,488) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-04-09 17:19:57,601) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-04-09 17:19:57,602) [ipw/ipw (line 767)]: The number of columns in the model matrix: 16
INFO (2026-04-09 17:19:57,602) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-04-09 17:20:15,411) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-04-09 17:20:15,412) [ipw/ipw (line 992)]: max_de: None
INFO (2026-04-09 17:20:15,413) [ipw/ipw (line 1014)]: Starting model selection
INFO (2026-04-09 17:20:15,416) [ipw/ipw (line 1047)]: Chosen lambda: 0.041158338186664825
INFO (2026-04-09 17:20:15,417) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.172637976731583
Adjusted balance Sample object with target set using ipw
1000 observations x 3 variables: gender,age_group,income,weight_pre_adjust,weight_adjusted_1
id_column: id, weight_column: weight,
outcome_columns: happiness
adjustment details:
method: ipw
weight trimming mean ratio: 20
design effect (Deff): 1.880
effective sample size proportion (ESSP): 0.532
effective sample size (ESS): 531.9
target:
SampleFrame: 10000 observations x 3 covariates: gender,age_group,income
id_column: id, weight_columns_all: ['weight'], outcome_columns: happiness
3 common variables: gender,age_group,income
# The original is still unadjusted:
print(f"bf.is_adjusted = {bf.is_adjusted}")
print(f"adjusted.is_adjusted = {adjusted.is_adjusted}")
bf.is_adjusted = False adjusted.is_adjusted = True
Evaluation of the Results¶
print(adjusted.summary())
Adjustment details:
method: ipw
weight trimming mean ratio: 20
Covariate diagnostics:
Covar ASMD reduction: 63.4%
Covar ASMD (7 variables): 0.327 -> 0.120
Covar mean KLD reduction: 92.3%
Covar mean KLD (3 variables): 0.157 -> 0.012
Weight diagnostics:
design effect (Deff): 1.880
effective sample size proportion (ESSP): 0.532
effective sample size (ESS): 531.9
Outcome weighted means:
happiness
source
self 53.295
target 56.278
unadjusted 48.559
Model performance: Model proportion deviance explained: 0.173
print(adjusted.covars().mean().T)
source self target unadjusted _is_na_gender[T.True] 0.086776 0.089800 0.088000 age_group[T.25-34] 0.307355 0.297400 0.300000 age_group[T.35-44] 0.273609 0.299200 0.156000 age_group[T.45+] 0.137581 0.206300 0.053000 gender[Female] 0.406337 0.455100 0.268000 gender[Male] 0.506887 0.455100 0.644000 gender[_NA] 0.086776 0.089800 0.088000 income 10.060068 12.737608 6.297302
We see an improvement in the average ASMD. Detailed per-variable ASMD:
print(adjusted.covars().asmd().T)
source self unadjusted unadjusted - self age_group[T.25-34] 0.021777 0.005688 -0.016090 age_group[T.35-44] 0.055884 0.312711 0.256827 age_group[T.45+] 0.169816 0.378828 0.209013 gender[Female] 0.097916 0.375699 0.277783 gender[Male] 0.103989 0.379314 0.275324 gender[_NA] 0.010578 0.006296 -0.004282 income 0.205469 0.494217 0.288748 mean(asmd) 0.119597 0.326799 0.207202
Covariate plots¶
adjusted.covars().plot()
# Seaborn KDE density plots
adjusted.covars().plot(library="seaborn", dist_type="kde")
ASCII plots¶
Use library="balance" for a text-based comparison of unadjusted,
adjusted, and target — useful in terminals or logging contexts.
adjusted.covars().plot(library="balance", bar_width=30);
=== gender (categorical) ===
Category | population adjusted sample
|
Female | █████████████████████ (50.0%)
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ (44.5%)
| ▐▐▐▐▐▐▐▐▐▐▐▐ (29.4%)
Male | █████████████████████ (50.0%)
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ (55.5%)
| ▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐ (70.6%)
Legend: █ population ▒ adjusted ▐ sample
Bar lengths are proportional to weighted frequency within each dataset.
=== age_group (categorical) ===
Category | population adjusted sample
|
18-24 | ████████████ (19.7%)
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ (28.1%)
| ▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐ (49.1%)
25-34 | ██████████████████ (29.7%)
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ (30.7%)
| ▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐ (30.0%)
35-44 | ██████████████████ (29.9%)
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ (27.4%)
| ▐▐▐▐▐▐▐▐▐▐ (15.6%)
45+ | █████████████ (20.6%)
| ▒▒▒▒▒▒▒▒ (13.8%)
| ▐▐▐ (5.3%)
Legend: █ population ▒ adjusted ▐ sample
Bar lengths are proportional to weighted frequency within each dataset.
=== income (numeric, comparative) ===
Range | population (%) | adjusted (%) | sample (%)
----------------------------------------------------------------------------------------------------------------
[0.00, 8.57) | ████████████████████ 49.0 | ████████████████████▒▒ 54.8 | ████████████████████▒▒▒▒▒▒▒▒▒▒ 73.2
[8.57, 17.14) | █████████ 23.1 | █████████▒▒ 26.3 | ████████] 19.2
[17.14, 25.71) | █████ 13.2 | █████ 12.3 | ██ ] 5.3
[25.71, 34.28) | ███ 7.3 | ██] 3.9 | █ ] 1.6
[34.28, 42.85) | ██ 3.9 | █] 1.5 | ] 0.4
[42.85, 51.41) | █ 1.8 | ] 0.2 | ] 0.1
[51.41, 59.98) | 0.9 | 1.0 | 0.2
[59.98, 68.55) | 0.4 | 0.0 | 0.0
[68.55, 77.12) | 0.2 | 0.0 | 0.0
[77.12, 85.69) | 0.1 | 0.0 | 0.0
[85.69, 94.26) | 0.0 | 0.0 | 0.0
[94.26, 102.83) | 0.0 | 0.0 | 0.0
[102.83, 111.40) | 0.0 | 0.0 | 0.0
[111.40, 119.97) | 0.0 | 0.0 | 0.0
[119.97, 128.54] | 0.0 | 0.0 | 0.0
----------------------------------------------------------------------------------------------------------------
Total | 100.0 | 100.0 | 100.0
Key: █ = shared with population, ▒ = excess, ] = deficit
Understanding the weights¶
adjusted.weights().plot()
print(adjusted.weights().summary().round(2))
var val 0 design_effect 1.88 1 effective_sample_proportion 0.53 2 effective_sample_size 531.92 3 sum 10000.00 4 describe_count 1000.00 5 describe_mean 1.00 6 describe_std 0.94 7 describe_min 0.30 8 describe_25% 0.45 9 describe_50% 0.65 10 describe_75% 1.17 11 describe_max 11.36 12 prop(w < 0.1) 0.00 13 prop(w < 0.2) 0.00 14 prop(w < 0.333) 0.11 15 prop(w < 0.5) 0.32 16 prop(w < 1) 0.67 17 prop(w >= 1) 0.33 18 prop(w >= 2) 0.10 19 prop(w >= 3) 0.03 20 prop(w >= 5) 0.01 21 prop(w >= 10) 0.00 22 nonparametric_skew 0.37 23 weighted_median_breakdown_point 0.21
Design effect and effective sample size¶
The new API exposes design effect diagnostics through the weights view:
adjusted.weights().design_effect() and adjusted.weights().design_effect_prop().
print(f"Design effect: {adjusted.weights().design_effect():.4f}")
print(f"Effective sample size %: {adjusted.weights().design_effect_prop():.2%}")
Design effect: 1.8800 Effective sample size %: 88.00%
Outcome analysis¶
print(adjusted.outcomes().summary())
1 outcomes: ['happiness']
Mean outcomes (with 95% confidence intervals):
source self target unadjusted self_ci target_ci unadjusted_ci
happiness 53.295 56.278 48.559 (52.096, 54.495) (55.961, 56.595) (47.669, 49.449)
Weights impact on outcomes (t_test):
mean_yw0 mean_yw1 mean_diff diff_ci_lower diff_ci_upper t_stat p_value n
outcome
happiness 48.559 53.295 4.736 1.312 8.161 2.714 0.007 1000.0
Response rates (relative to number of respondents in sample):
happiness
n 1000.0
% 100.0
Response rates (relative to notnull rows in the target):
happiness
n 1000.0
% 10.0
Response rates (in the target):
happiness
n 10000.0
% 100.0
adjusted.outcomes().plot()
Analytics via BalanceDF views¶
The new API accesses analytics through the .covars(), .weights(),
and .outcomes() views rather than top-level convenience methods:
print("Covariate means (unadjusted / adjusted / target):")
print(adjusted.covars().mean().T)
Covariate means (unadjusted / adjusted / target):
source self target unadjusted _is_na_gender[T.True] 0.086776 0.089800 0.088000 age_group[T.25-34] 0.307355 0.297400 0.300000 age_group[T.35-44] 0.273609 0.299200 0.156000 age_group[T.45+] 0.137581 0.206300 0.053000 gender[Female] 0.406337 0.455100 0.268000 gender[Male] 0.506887 0.455100 0.644000 gender[_NA] 0.086776 0.089800 0.088000 income 10.060068 12.737608 6.297302
print("Outcome SD proportional change:")
print(adjusted.outcomes().outcome_sd_prop())
print()
print("Outcome variance ratio (adjusted / unadjusted):")
print(adjusted.outcomes().outcome_variance_ratio())
Outcome SD proportional change: happiness 0.013516 dtype: float64 Outcome variance ratio (adjusted / unadjusted): happiness 1.027215 dtype: float64
Other Adjustment Methods¶
BalanceFrame supports all the same methods as Sample:
"ipw"— inverse propensity weighting (default)"cbps"— covariate balancing propensity score"rake"— iterative proportional fitting (raking)"poststratify"— post-stratification
Each returns a new BalanceFrame — the original stays unchanged.
adjusted_cbps = bf.adjust(method="cbps")
print(f"CBPS design effect: {adjusted_cbps.weights().design_effect():.4f}")
print(adjusted_cbps.covars().asmd(aggregate_by_main_covar=True).T)
INFO (2026-04-09 17:20:18,381) [cbps/cbps (line 538)]: Starting cbps function
INFO (2026-04-09 17:20:18,383) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-09 17:20:18,384) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-04-09 17:20:18,392) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-04-09 17:20:18,511) [cbps/cbps (line 589)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-04-09 17:20:18,513) [cbps/cbps (line 600)]: The number of columns in the model matrix: 16
INFO (2026-04-09 17:20:18,513) [cbps/cbps (line 601)]: The number of rows in the model matrix: 11000
INFO (2026-04-09 17:20:18,520) [cbps/cbps (line 670)]: Finding initial estimator for GMM optimization
INFO (2026-04-09 17:20:18,676) [cbps/cbps (line 697)]: Finding initial estimator for GMM optimization that minimizes the balance loss
INFO (2026-04-09 17:20:20,129) [cbps/cbps (line 733)]: Running GMM optimization
INFO (2026-04-09 17:20:21,701) [cbps/cbps (line 860)]: Done cbps function
CBPS design effect: 2.7543 source self unadjusted unadjusted - self age_group 0.064140 0.232409 0.168269 gender 0.044220 0.253769 0.209549 income 0.113018 0.494217 0.381199 mean(asmd) 0.073793 0.326799 0.253006
Compound / Sequential Adjustments¶
adjust() can be called multiple times — each call uses the
previously adjusted weights as design weights, so adjustments compound.
This is useful for multi-step workflows, e.g., IPW for broad correction
followed by raking for fine-tuning on specific variables.
The original unadjusted baseline is always preserved:
_sf_sample_pre_adjustpoints to the original SampleFrame_links["unadjusted"]points to the original BalanceFrameasmd_improvement()shows total improvement across all steps
# Step 1: broad IPW correction across all covariates
adjusted_ipw = bf.adjust(method="ipw", max_de=2)
# Step 2: fine-tune with raking on gender and age_group
adjusted_final = adjusted_ipw.adjust(method="rake", variables=["gender", "age_group"])
print("After IPW only:")
print(adjusted_ipw.covars().asmd(aggregate_by_main_covar=True).T)
print("\nAfter IPW + rake on gender & age_group:")
print(adjusted_final.covars().asmd(aggregate_by_main_covar=True).T)
print(f"\nTotal ASMD improvement (vs original): {adjusted_final.covars().asmd_improvement():.2%}")
# The original BalanceFrame is unchanged (immutable pattern)
print(f"\nbf.is_adjusted = {bf.is_adjusted}")
print(f"adjusted_final.is_adjusted = {adjusted_final.is_adjusted}")
INFO (2026-04-09 17:20:21,858) [ipw/ipw (line 703)]: Starting ipw function
INFO (2026-04-09 17:20:21,860) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-09 17:20:21,861) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group', 'income']
INFO (2026-04-09 17:20:21,869) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group', 'income']
INFO (2026-04-09 17:20:21,877) [ipw/ipw (line 738)]: Building model matrix
INFO (2026-04-09 17:20:21,987) [ipw/ipw (line 764)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2026-04-09 17:20:21,988) [ipw/ipw (line 767)]: The number of columns in the model matrix: 16
INFO (2026-04-09 17:20:21,989) [ipw/ipw (line 768)]: The number of rows in the model matrix: 11000
INFO (2026-04-09 17:20:23,569) [ipw/ipw (line 990)]: Done with sklearn
INFO (2026-04-09 17:20:23,570) [ipw/ipw (line 992)]: max_de: 2
INFO (2026-04-09 17:20:23,571) [ipw/choose_regularization (line 368)]: Starting choosing regularisation parameters
INFO (2026-04-09 17:20:32,462) [ipw/choose_regularization (line 454)]: Best regularisation:
s s_index trim design_effect asmd_improvement asmd
9 0.009726 125 5.0 1.998665 0.711052 0.05646
INFO (2026-04-09 17:20:32,465) [ipw/ipw (line 1047)]: Chosen lambda: 0.009726392859944848
INFO (2026-04-09 17:20:32,466) [ipw/ipw (line 1065)]: Proportion null deviance explained 0.18189302029172694
INFO (2026-04-09 17:20:32,475) [adjustment/apply_transformations (line 433)]: Adding the variables: []
INFO (2026-04-09 17:20:32,476) [adjustment/apply_transformations (line 434)]: Transforming the variables: ['gender', 'age_group']
INFO (2026-04-09 17:20:32,481) [adjustment/apply_transformations (line 469)]: Final variables in output: ['gender', 'age_group']
INFO (2026-04-09 17:20:32,487) [rake/rake (line 279)]: Final covariates and levels that will be used in raking: {'age_group': ['18-24', '25-34', '35-44', '45+'], 'gender': ['Female', 'Male', '__NaN__']}.
After IPW only: source self unadjusted unadjusted - self age_group 0.071234 0.232409 0.161175 gender 0.059885 0.253769 0.193884 income 0.189626 0.494217 0.304591 mean(asmd) 0.106915 0.326799 0.219883 After IPW + rake on gender & age_group:
source self unadjusted unadjusted - self age_group 0.196823 0.232409 0.035586 gender 0.204462 0.253769 0.049307 income 0.494080 0.494217 0.000138 mean(asmd) 0.298455 0.326799 0.028344 Total ASMD improvement (vs original): 8.67% bf.is_adjusted = False adjusted_final.is_adjusted = True
Transformations¶
Transformations (one-hot encoding, interaction terms, etc.) are applied
automatically during adjust(). You can also pass custom transformations
via kwargs. To inspect the transformed design matrix, use model_matrix().
# Inspect the model matrix (transformed covariates) used during adjustment
print("Transformed covariates columns:")
print(adjusted.model_matrix().columns.tolist())
Transformed covariates columns: ['_is_na_gender[T.True]', 'age_group[T.25-34]', 'age_group[T.35-44]', 'age_group[T.45+]', 'gender[Female]', 'gender[Male]', 'gender[_NA]', 'income']
Diagnostics¶
diagnostics() returns a DataFrame with bias metrics per covariate.
print(adjusted.diagnostics().to_string())
INFO (2026-04-09 17:20:32,927) [balance_frame/diagnostics (line 1412)]: Starting computation of diagnostics of the fitting
INFO (2026-04-09 17:20:33,234) [balance_frame/diagnostics (line 1438)]: Done computing diagnostics
metric val var 0 size 1000.000000 sample_obs 1 size 3.000000 sample_covars 2 size 10000.000000 target_obs 3 size 3.000000 target_covars 4 weights_diagnostics 1.879989 design_effect 5 weights_diagnostics 0.531918 effective_sample_proportion 6 weights_diagnostics 531.918146 effective_sample_size 7 weights_diagnostics 10000.000000 sum 8 weights_diagnostics 1000.000000 describe_count 9 weights_diagnostics 1.000000 describe_mean 10 weights_diagnostics 0.938546 describe_std 11 weights_diagnostics 0.304163 describe_min 12 weights_diagnostics 0.445457 describe_25% 13 weights_diagnostics 0.653173 describe_50% 14 weights_diagnostics 1.166355 describe_75% 15 weights_diagnostics 11.355142 describe_max 16 weights_diagnostics 0.000000 prop(w < 0.1) 17 weights_diagnostics 0.000000 prop(w < 0.2) 18 weights_diagnostics 0.106000 prop(w < 0.333) 19 weights_diagnostics 0.323000 prop(w < 0.5) 20 weights_diagnostics 0.668000 prop(w < 1) 21 weights_diagnostics 0.332000 prop(w >= 1) 22 weights_diagnostics 0.096000 prop(w >= 2) 23 weights_diagnostics 0.030000 prop(w >= 3) 24 weights_diagnostics 0.011000 prop(w >= 5) 25 weights_diagnostics 0.001000 prop(w >= 10) 26 weights_diagnostics 0.369537 nonparametric_skew 27 weights_diagnostics 0.214000 weighted_median_breakdown_point 28 weights_impact_on_outcome_mean_yw0 48.558814 happiness 29 weights_impact_on_outcome_mean_yw1 53.295272 happiness 30 weights_impact_on_outcome_mean_diff 4.736458 happiness 31 weights_impact_on_outcome_diff_ci_lower 1.312255 happiness 32 weights_impact_on_outcome_diff_ci_upper 8.160661 happiness 33 weights_impact_on_outcome_t_stat 2.714368 happiness 34 weights_impact_on_outcome_p_value 0.006755 happiness 35 weights_impact_on_outcome_n 1000.000000 happiness 36 adjustment_method 0.000000 ipw 37 ipw_model_glance 9.000000 n_iter_ 38 ipw_model_glance 0.138619 intercept_ 39 ipw_penalty 0.000000 deprecated 40 ipw_solver 0.000000 lbfgs 41 model_glance 0.000100 tol 42 model_glance 0.000000 l1_ratio 43 ipw_multi_class 0.000000 auto 44 model_glance 0.041158 lambda 45 model_glance 1.386294 null_deviance 46 model_glance 1.146967 deviance 47 model_glance 0.172638 prop_dev_explained 48 model_glance 1.155558 cv_dev_mean 49 model_glance 0.002568 lambda_min 50 model_glance 1.141446 min_cv_dev_mean 51 model_glance 0.014287 min_cv_dev_sd 52 model_coef 0.138619 intercept 53 model_coef 0.043944 _is_na_gender[T.True] 54 model_coef -0.203732 age_group[T.25-34] 55 model_coef -0.428683 age_group[T.35-44] 56 model_coef -0.529556 age_group[T.45+] 57 model_coef 0.332490 gender[T.Male] 58 model_coef 0.043944 gender[T._NA] 59 model_coef 0.169578 income[Interval(-0.0009997440000000001, 0.44, closed='right')] 60 model_coef 0.154197 income[Interval(0.44, 1.664, closed='right')] 61 model_coef 0.111212 income[Interval(1.664, 3.472, closed='right')] 62 model_coef -0.041457 income[Interval(11.312, 15.139, closed='right')] 63 model_coef -0.161148 income[Interval(15.139, 20.567, closed='right')] 64 model_coef -0.211197 income[Interval(20.567, 29.504, closed='right')] 65 model_coef -0.357491 income[Interval(29.504, 128.536, closed='right')] 66 model_coef 0.093738 income[Interval(3.472, 5.663, closed='right')] 67 model_coef 0.072936 income[Interval(5.663, 8.211, closed='right')] 68 model_coef 0.005787 income[Interval(8.211, 11.312, closed='right')] 69 covar_asmd_adjusted 0.021777 age_group[T.25-34] 70 covar_asmd_adjusted 0.055884 age_group[T.35-44] 71 covar_asmd_adjusted 0.169816 age_group[T.45+] 72 covar_asmd_adjusted 0.097916 gender[Female] 73 covar_asmd_adjusted 0.103989 gender[Male] 74 covar_asmd_adjusted 0.010578 gender[_NA] 75 covar_asmd_adjusted 0.205469 income 76 covar_asmd_adjusted 0.119597 mean(asmd) 77 covar_asmd_unadjusted 0.005688 age_group[T.25-34] 78 covar_asmd_unadjusted 0.312711 age_group[T.35-44] 79 covar_asmd_unadjusted 0.378828 age_group[T.45+] 80 covar_asmd_unadjusted 0.375699 gender[Female] 81 covar_asmd_unadjusted 0.379314 gender[Male] 82 covar_asmd_unadjusted 0.006296 gender[_NA] 83 covar_asmd_unadjusted 0.494217 income 84 covar_asmd_unadjusted 0.326799 mean(asmd) 85 covar_asmd_improvement -0.016090 age_group[T.25-34] 86 covar_asmd_improvement 0.256827 age_group[T.35-44] 87 covar_asmd_improvement 0.209013 age_group[T.45+] 88 covar_asmd_improvement 0.277783 gender[Female] 89 covar_asmd_improvement 0.275324 gender[Male] 90 covar_asmd_improvement -0.004282 gender[_NA] 91 covar_asmd_improvement 0.288748 income 92 covar_asmd_improvement 0.207202 mean(asmd) 93 covar_main_asmd_adjusted 0.082492 age_group 94 covar_main_asmd_unadjusted 0.232409 age_group 95 covar_main_asmd_improvement 0.149917 age_group 96 covar_main_asmd_adjusted 0.070828 gender 97 covar_main_asmd_unadjusted 0.253769 gender 98 covar_main_asmd_improvement 0.182942 gender 99 covar_main_asmd_adjusted 0.205469 income 100 covar_main_asmd_unadjusted 0.494217 income 101 covar_main_asmd_improvement 0.288748 income 102 covar_main_asmd_adjusted 0.119597 mean(asmd) 103 covar_main_asmd_unadjusted 0.326799 mean(asmd) 104 covar_main_asmd_improvement 0.207202 mean(asmd) 105 adjustment_failure 0.000000 NaN
Exporting results¶
The .df property returns the responder DataFrame with id, covariates,
outcomes, weights, and any ignored columns. Use .to_csv() to export
the adjusted data.
print("Adjusted DataFrame columns:", adjusted.df.columns.tolist())
print(adjusted.df.head())
Adjusted DataFrame columns: ['id', 'gender', 'age_group', 'income', 'happiness', 'weight'] id gender age_group income happiness weight 0 0 Male 25-34 6.428659 26.043029 6.531728 1 1 Female 18-24 9.940280 66.885485 9.617159 2 2 Male 18-24 2.673623 37.091922 3.562973 3 3 NaN 18-24 10.550308 49.394050 6.952117 4 4 NaN 18-24 2.689994 72.304208 5.129230
# Export to CSV (showing first 500 characters)
print(adjusted.to_csv()[:500])
id,gender,age_group,income,happiness,weight 0,Male,25-34,6.428659499046228,26.043028759747298,6.531727983159214 1,Female,18-24,9.940280228116047,66.88548460632677,9.617159404461365 2,Male,18-24,2.6736231547518043,37.091921916683006,3.562973405562926 3,,18-24,10.550307519418066,49.39405003271002,6.952116676608549 4,,18-24,2.689993854299385,72.30420755038209,5.1292302114666075 5,,35-44,5.995497722733131,57.28281646341816,16.424761754946537 6,,18-24,12.63469573898972,31.663293445944596,8.1911333259
adjusted.to_download()
Filtering rows/columns¶
keep_only_some_rows_columns() returns a new BalanceFrame with
filtered data — the original remains unchanged (immutable pattern).
filtered = adjusted.keep_only_some_rows_columns(
rows_to_keep="gender == 'Female'",
columns_to_keep=["gender", "age", "income"],
)
print(f"Original rows: {len(adjusted.responders)}")
print(f"Filtered rows: {len(filtered.responders)}")
print(filtered.df.head())
INFO (2026-04-09 17:20:33,282) [balance_frame/_filter_sf (line 1645)]: (rows_filtered/total_rows) = (268/1000)
INFO (2026-04-09 17:20:33,285) [balance_frame/_filter_sf (line 1645)]: (rows_filtered/total_rows) = (4551/10000)
INFO (2026-04-09 17:20:33,288) [balance_frame/_filter_sf (line 1645)]: (rows_filtered/total_rows) = (268/1000)
INFO (2026-04-09 17:20:33,291) [balance_frame/_filter_sf (line 1645)]: (rows_filtered/total_rows) = (268/268)
INFO (2026-04-09 17:20:33,293) [balance_frame/_filter_sf (line 1645)]: (rows_filtered/total_rows) = (4551/4551)
INFO (2026-04-09 17:20:33,295) [balance_frame/_filter_sf (line 1645)]: (rows_filtered/total_rows) = (268/268)
Original rows: 1000 Filtered rows: 268 id gender income happiness weight 0 1 Female 9.940280 66.885485 9.617159 1 92 Female 0.185097 84.464522 17.392266 2 94 Female 1.183696 65.742184 17.794007 3 95 Female 3.716007 67.624539 7.283279 4 98 Female 16.751931 44.868651 48.725241
Summary: Old vs New API side-by-side¶
| Step | Old API (Sample) |
New API (SampleFrame / BalanceFrame) |
|---|---|---|
| Load data | s = Sample.from_frame(df) |
sf = SampleFrame.from_frame(df) |
| Pair sample + target | s.set_target(target) |
bf = BalanceFrame(sample=sf, target=tf) |
| Adjust | adjusted = s.adjust() (mutates s) |
adjusted = bf.adjust() (bf unchanged) |
| Summary | adjusted.summary() |
adjusted.summary() |
| Diagnostics | adjusted.diagnostics() |
adjusted.diagnostics() |
| Covariates | adjusted.covars().mean() |
adjusted.covars().mean() |
| Design effect | adjusted.design_effect() |
adjusted.weights().design_effect() |
| CSV export | adjusted.to_csv() |
adjusted.to_csv() |
| Filter | (not available) | adjusted.keep_only_some_rows_columns(...) |