balance.weighting_methods.ipw

balance.weighting_methods.ipw.calc_dev(X_matrix: csr_matrix, y: ndarray, model: ClassifierMixin, model_weights: ndarray, foldids: ndarray) Tuple[float, float][source]

10 fold cross validation to calculate holdout deviance.

Parameters:
  • X_matrix (csr_matrix) – Model matrix,

  • y (np.ndarray) – Vector of sample inclusion (1=sample, 0=target),

  • model (_type_) – LogisticRegression object from sklearn,

  • model_weights (np.ndarray) – Vector of sample and target weights,

  • foldids (np.ndarray) – Vector of cross-validation fold indices.

Returns:

mean and standard deviance of holdout deviance.

Return type:

float, float

Examples: .. code-block:: python

import numpy as np from scipy.sparse import csr_matrix from sklearn.linear_model import LogisticRegression from balance.weighting_methods.ipw import calc_dev X = csr_matrix(np.arange(10).reshape(10, 1)) y = np.array([0, 1] * 5) model = LogisticRegression(max_iter=1000) weights = np.ones(10) foldids = np.arange(10) dev_mean, dev_sd = calc_dev(X, y, model, weights, foldids) dev_mean > 0 # True

balance.weighting_methods.ipw.choose_regularization(links: List[Any], lambdas: ndarray, sample_df: DataFrame, target_df: DataFrame, sample_weights: Series, target_weights: Series, balance_classes: bool, max_de: float = 1.5, trim_options: Tuple[int, int, int, float, float, float, float, float, float, float] = (20, 10, 5, 2.5, 1.25, 0.5, 0.25, 0.125, 0.05, 0.01), n_asmd_candidates: int = 10) Dict[str, Any][source]

Searches through the regularisation parameters of the model and weight trimming levels to find the combination with the highest covariate ASMD reduction (in sample_df and target_df, NOT in the model matrix used for modeling the response) subject to the design effect being lower than max_de (deafults to 1.5). The function preforms a grid search over the n_asmd_candidates (deafults to 10) models with highest DE lower than max_de (assuming higher DE means more bias reduction).

Parameters:
  • links (Links[Any]) – list of link predictions from sklearn

  • lambdas (np.ndarray) – the lambda values for regularization

  • sample_df (pd.DataFrame) – a dataframe representing the sample

  • target_df (pd.DataFrame) – a dataframe representing the target

  • sample_weights (pd.Series) – design weights for sample

  • target_weights (pd.Series) – design weights for target

  • balance_classes (bool) – whether balance_classes used

  • max_de (float, optional) – upper bound for the design effect of the computed weights. Used for choosing the model regularization and trimming. If set to None, then it uses ‘lambda_1se’. Defaults to 1.5.

  • trim_options (Tuple[ int, int, int, float, float, float, float, float, float, float ], optional) – options for weight_trimming_mean_ratio. Defaults to (20, 10, 5, 2.5, 1.25, 0.5, 0.25, 0.125, 0.05, 0.01).

  • n_asmd_candidates (int, optional) – number of candidates for grid search.. Defaults to 10.

Returns:

Dict of the value of the chosen lambda, the value of trimming, model description.
Shape is
{

“best”: {“s”: best.s.values, “trim”: best.trim.values[0]}, “perf”: all_perf,

}

Return type:

Dict[str, Any]

Examples: .. code-block:: python

import numpy as np import pandas as pd from balance.weighting_methods.ipw import choose_regularization links = [np.zeros(2)] lambdas = np.array([1.0]) sample_df = pd.DataFrame({“x”: [0, 1]}) target_df = pd.DataFrame({“x”: [0, 1]}) sample_weights = pd.Series([1.0, 1.0]) target_weights = pd.Series([1.0, 1.0]) result = choose_regularization(

links, lambdas, sample_df, target_df, sample_weights, target_weights, False, max_de=2.0, trim_options=(1.0,), n_asmd_candidates=1,

) sorted(result.keys()) # [‘best’, ‘perf’]

balance.weighting_methods.ipw.ipw(sample_df: DataFrame, sample_weights: Series, target_df: DataFrame, target_weights: Series, variables: list[str] | None = None, model: str | ClassifierMixin | None = 'sklearn', weight_trimming_mean_ratio: int | float | None = 20, weight_trimming_percentile: float | None = None, balance_classes: bool = True, transformations: str | None = 'default', na_action: str = 'add_indicator', max_de: float | None = None, lambda_min: float = 1e-05, lambda_max: float = 10, num_lambdas: int = 250, formula: str | list[str] | None = None, penalty_factor: list[float] | None = None, one_hot_encoding: bool = False, use_model_matrix: bool = True, random_seed: int = 2020, *args: Any, **kwargs: Any) Dict[str, Any][source]

Fit an ipw (inverse propensity score weighting) for the sample using the target.

Args:

sample_df (pd.DataFrame): a dataframe representing the sample sample_weights (pd.Series): design weights for sample target_df (pd.DataFrame): a dataframe representing the target target_weights (pd.Series): design weights for target variables (Optional[List[str]], optional): list of variables to include in the model.

If None all joint variables of sample_df and target_df are used. Defaults to None.

model (Union[str, ClassifierMixin, None], optional): Model used for modeling the

propensity scores. Provide "sklearn" (default) or None to use logistic regression, or pass an sklearn classifier implementing fit and predict_proba. Common choices include scikit-learn estimators such as sklearn.linear_model.LogisticRegression, sklearn.ensemble.RandomForestClassifier, sklearn.ensemble.GradientBoostingClassifier, sklearn.ensemble.HistGradientBoostingClassifier, and sklearn.linear_model.SGDClassifier configured with loss="log_loss". To customize the built-in logistic regression settings, pass a configured sklearn.linear_model.LogisticRegression instance as model. Custom classifiers should expose a predict_proba method returning class probabilities.

weight_trimming_mean_ratio (Optional[Union[int, float]], optional): indicating the ratio from above according to which

the weights are trimmed by mean(weights) * ratio. Defaults to 20.

weight_trimming_percentile (Optional[float], optional): if weight_trimming_percentile is not none, winsorization is applied.

if None then trimming is applied. Defaults to None.

balance_classes (bool, optional): whether to balance the sample and target size for running the model.

True is preferable for imbalanced cases. It shouldn’t have an effect on the final weights as this is factored into the computation of the weights. TODO: add ref. Defaults to True.

transformations (str, optional): what transformations to apply to data before fitting the model.

See apply_transformations function. Defaults to “default”.

na_action (str, optional): what to do with NAs.

See add_na_indicator function. Defaults to “add_indicator”.

max_de (Optional[float], optional): upper bound for the design effect of the computed weights.

Used for choosing the model regularization and trimming. If set to None, then it uses ‘lambda_1se’. Defaults to 1.5.

formula (Union[str, List[str], None], optional): The formula according to which build the model.

In case of list of formula, the model matrix will be built in steps and concatenated together. Defaults to None.

penalty_factor (Optional[List[float]], optional): the penalty factors used in ipw. The penalty

should have the same length as the formula list (and applies to each element of formula). Smaller penalty on some formula will lead to elements in that formula to get more adjusted, i.e. to have a higher chance to get into the model (and not zero out). A penalty of 0 will make sure the element is included in the model. If not provided, assume the same penalty (1) for all variables. Defaults to None.

one_hot_encoding (bool, optional): whether to encode all factor variables in the model matrix with

almost_one_hot_encoding. This is recomended in case of using LASSO on the data (Default: False). one_hot_encoding_greater_3 creates one-hot-encoding for all categorical variables with more than 2 categories (i.e. the number of columns will be equal to the number of categories), and only 1 column for variables with 2 levels (treatment contrast). Defaults to False.

use_model_matrix (bool, optional): whether to build the model matrix using

model_matrix(). When set to False, the model is fit on the raw covariate data after applying transformations and adding NA indicators. String, object, and boolean columns are converted to pandas Categorical dtype, which allows sklearn estimators that support native categorical features (e.g., HistGradientBoostingClassifier with categorical_features="from_dtype") to handle them correctly. Requires scikit-learn >= 1.4 when categorical columns are present; a ValueError is raised on older versions. The built-in logistic-regression path (i.e., when model is None or the default) currently requires use_model_matrix=True and will raise if use_model_matrix=False. Defaults to True.

random_seed (int, optional): Random seed to use. Defaults to 2020.

Examples:

Example 1: Using HistGradientBoostingClassifier with native categorical support

import pandas as pd
from balance.datasets import load_sim_data
from balance.weighting_methods.ipw import ipw
from sklearn.ensemble import HistGradientBoostingClassifier

# Load simulated data
target_df, sample_df = load_sim_data()

# Assign weights
sample_weights = pd.Series(1, index=sample_df.index)
target_weights = pd.Series(1, index=target_df.index)

# Define model (categorical_features="from_dtype" tells sklearn
# to treat pandas Categorical columns as unordered categoricals)
hgb = HistGradientBoostingClassifier(
    random_state=0, categorical_features="from_dtype"
)

# Run IPW with sklearn model on raw covariates
result_hgb = ipw(
    sample_df,
    sample_weights,
    target_df,
    target_weights,
    variables=["gender", "age_group", "income"],
    model=hgb,
    use_model_matrix=False,
)
print("HistGradientBoostingClassifier result:")
print(result_hgb)

Output (values will vary by model and random seed):

HistGradientBoostingClassifier result:
{'weight': 0      ...
1      ...
...
999    ...
Length: 1000, dtype: float64, 'model': {'method': 'ipw', ...}}

Example 2: Using default sklearn model (LogisticRegression)

# Run IPW with default sklearn model
result_sklearn = ipw(
    sample_df,
    sample_weights,
    target_df,
    target_weights,
    variables=["gender", "age_group", "income"],
    model="sklearn",
)
print("Default sklearn model result:")
print(result_sklearn)

Output:

Default sklearn model result:
{'weight': 0       6.531728
1       9.617159
2       3.562973
3       6.952117
4       5.129230
         ...
995     9.353052
996     3.973554
997     7.095483
998    11.331144
999     7.913133
Length: 1000, dtype: float64, 'model': {'method': 'ipw', ...}}

Example 3: Comparing weights from different models

import pandas as pd
# Combine weights into a DataFrame
comparison_df = pd.DataFrame({
    'HGB': result_hgb['weight'],
    'Sklearn': result_sklearn['weight']
})
# Calculate difference
comparison_df['Difference'] = comparison_df['HGB'] - comparison_df['Sklearn']
# Print summary statistics
print("
Mean difference:”, comparison_df[‘Difference’].mean())

print(“Std difference:”, comparison_df[‘Difference’].std())

Output (values will vary by model and random seed):

Mean difference: ...
Std difference: ...
Raises:

Exception: f”Sample indicator only has value {_n_unique}. This can happen when your sample or target are empty from unknown reason” NotImplementedError: If model is a string other than “sklearn” (the

built-in logistic regression option) or the deprecated “glmnet”.

TypeError: If model is neither a supported string nor an sklearn

classifier exposing predict_proba.

Returns:
Dict[str, Any]: A dictionary includes:

“weight” — The weights for the sample. “model” — parameters of the model:fit, performance, X_matrix_columns, lambda,

weight_trimming_mean_ratio

Shape of the Dict: {

“weight”: weights, “model”: {

“method”: “ipw”, “X_matrix_columns”: X_matrix_columns_names, “fit”: fit, “perf”: performance, “lambda”: best_s, “weight_trimming_mean_ratio”: weight_trimming_mean_ratio,

},

}

Transforms probabilities into log odds (link function).

Parameters:

pred (np.ndarray) – LogisticRegression probability predictions from sklearn.

Returns:

Array of log odds.

Return type:

np.ndarray

Examples: .. code-block:: python

import numpy as np from balance.weighting_methods.ipw import link_transform float(link_transform(np.array([0.5]))[0]) # 0.0

balance.weighting_methods.ipw.model_coefs(model: ClassifierMixin, feature_names: list[str] | None = None) Dict[str, Any][source]

Extract coefficient-like information from sklearn classifiers.

For linear models such as LogisticRegression, this returns the fitted coefficients (and intercept when available). For classifiers that do not expose a coef_ attribute (e.g. tree ensembles), an empty pandas.Series is returned so downstream diagnostics can handle the absence of coefficients gracefully.

Parameters:
  • model (ClassifierMixin) – Fitted sklearn classifier.

  • feature_names (Optional[list], optional) – Feature names associated with the model matrix columns. When provided and the model exposes a one-dimensional coef_ array, the returned Series is indexed by ["intercept"] + feature_names.

Returns:

Dictionary containing a coefs key with a pandas.Series of coefficients (which may be empty when the model does not expose linear coefficients).

Return type:

Dict[str, Any]

Examples: .. code-block:: python

import numpy as np from sklearn.linear_model import LogisticRegression from balance.weighting_methods.ipw import model_coefs X = np.array([[0.0], [1.0]]) y = np.array([0, 1]) model = LogisticRegression().fit(X, y) “coefs” in model_coefs(model) # True

Transform link predictions into weights, by exponentiating them, and optionally balancing the classes and trimming the weights, then normalize the weights to have sum equal to the sum of the target weights.

Parameters:
  • link (Any) – link predictions

  • balance_classes (bool) – whether balance_classes used

  • sample_weights (pd.Series) – vector of sample weights

  • target_weights (pd.Series) – vector of sample weights

  • weight_trimming_mean_ratio (Union[None, float, int], optional) – to be used in trim_weights(). Defaults to None.

  • weight_trimming_percentile (Optional[float], optional) – to be used in trim_weights(). Defaults to None.

  • keep_sum_of_weights (bool, optional) – to be used in trim_weights(). Defaults to True.

Returns:

A vecotr of normalized weights (for sum of target weights)

Return type:

pd.Series

Examples: .. code-block:: python

import numpy as np import pandas as pd from balance.weighting_methods.ipw import weights_from_link link = np.zeros(2) sample_weights = pd.Series([1.0, 1.0]) target_weights = pd.Series([1.0, 1.0]) weights_from_link(link, False, sample_weights, target_weights).tolist() # [1.0, 1.0]