balance.weighting_methods.ipw

balance.weighting_methods.ipw.choose_regularization(fit, sample_df: DataFrame, target_df: DataFrame, sample_weights: Series, target_weights: Series, X_matrix_sample, balance_classes: bool, max_de: float = 1.5, trim_options: Tuple[int, int, int, float, float, float, float, float, float, float] = (20, 10, 5, 2.5, 1.25, 0.5, 0.25, 0.125, 0.05, 0.01), n_asmd_candidates: int = 10) Dict[str, Any][source]

Searches through the regularisation parameters of the model and weight trimming levels to find the combination with the highest covariate ASMD reduction (in sample_df and target_df, NOT in the model matrix used for modeling the response) subject to the design effect being lower than max_de (deafults to 1.5). The function preforms a grid search over the n_asmd_candidates (deafults to 10) models with highest DE lower than max_de (assuming higher DE means more bias reduction).

Parameters:
  • fit (_type_) – output of cvglmnet

  • sample_df (pd.DataFrame) – a dataframe representing the sample

  • target_df (pd.DataFrame) – a dataframe representing the target

  • sample_weights (pd.Series) – design weights for sample

  • target_weights (pd.Series) – design weights for target

  • X_matrix_sample (_type_) – the matrix that was used to consturct the model

  • balance_classes (bool) – whether balance_classes used in glmnet

  • max_de (float, optional) – upper bound for the design effect of the computed weights. Used for choosing the model regularization and trimming. If set to None, then it uses ‘lambda_1se’. Defaults to 1.5.

  • trim_options (Tuple[ int, int, int, float, float, float, float, float, float, float ], optional) – options for weight_trimming_mean_ratio. Defaults to (20, 10, 5, 2.5, 1.25, 0.5, 0.25, 0.125, 0.05, 0.01).

  • n_asmd_candidates (int, optional) – number of candidates for grid search.. Defaults to 10.

Returns:

Dict of the value of the chosen lambda, the value of trimming, model description.
Shape is
{

“best”: {“s”: best.s.values, “trim”: best.trim.values[0]}, “perf”: all_perf,

}

Return type:

Dict[str, Any]

balance.weighting_methods.ipw.cv_glmnet_performance(fit, feature_names: list | None = None, s: str | float | None = 'lambda_1se') Dict[str, Any][source]

Extract elements from cvglmnet to describe the fitness quality.

Parameters:
  • fit (_type_) – output of cvglmnet

  • feature_names (Optional[list], optional) – The coeficieents of which features should be included. None = all features are included. Defaults to None.

  • s (Union[str, float, None], optional) – lambda avlue for cvglmnet. Defaults to “lambda_1se”.

Raises:

Exception – _description_

Returns:

Dict of the shape:
{

“prop_dev_explained”: fit[“glmnet_fit”][“dev”][optimal_lambda_index], “mean_cv_error”: fit[“cvm”][optimal_lambda_index], “coefs”: coefs,

}

Return type:

Dict[str, Any]

balance.weighting_methods.ipw.ipw(sample_df: DataFrame, sample_weights: Series, target_df: DataFrame, target_weights: Series, variables: List[str] | None = None, model: str = 'glmnet', weight_trimming_mean_ratio: float | int | None = 20, weight_trimming_percentile: float | None = None, balance_classes: bool = True, transformations: str = 'default', na_action: str = 'add_indicator', max_de: float | None = None, formula: str | List[str] | None = None, penalty_factor: List[float] | None = None, one_hot_encoding: bool = False, random_seed: int = 2020, *args, **kwargs) Dict[str, Any][source]

Fit an ipw (inverse propensity score weighting) for the sample using the target.

Parameters:
  • sample_df (pd.DataFrame) – a dataframe representing the sample

  • sample_weights (pd.Series) – design weights for sample

  • target_df (pd.DataFrame) – a dataframe representing the target

  • target_weights (pd.Series) – design weights for target

  • variables (Optional[List[str]], optional) – list of variables to include in the model. If None all joint variables of sample_df and target_df are used. Defaults to None.

  • model (str, optional) – the model used for modeling the propensity scores. “glmnet” is logistic model. Defaults to “glmnet”.

  • weight_trimming_mean_ratio (Optional[Union[int, float]], optional) – indicating the ratio from above according to which the weights are trimmed by mean(weights) * ratio. Defaults to 20.

  • weight_trimming_percentile (Optional[float], optional) – if weight_trimming_percentile is not none, winsorization is applied. if None then trimming is applied. Defaults to None.

  • balance_classes (bool, optional) – whether to balance the sample and target size for running the model. True is preferable for imbalanced cases. It is done to make the computation of the glmnet more efficient. It shouldn’t have an effect on the final weights as this is factored into the computation of the weights. TODO: add ref. Defaults to True.

  • transformations (str, optional) – what transformations to apply to data before fitting the model. See apply_transformations function. Defaults to “default”.

  • na_action (str, optional) – what to do with NAs. See add_na_indicator function. Defaults to “add_indicator”.

  • max_de (Optional[float], optional) – upper bound for the design effect of the computed weights. Used for choosing the model regularization and trimming. If set to None, then it uses ‘lambda_1se’. Defaults to 1.5.

  • formula (Union[str, List[str], None], optional) – The formula according to which build the model. In case of list of formula, the model matrix will be built in steps and concatenated together. Defaults to None.

  • penalty_factor (Optional[List[float]], optional) – the penalty used in the glmnet function in ipw. The penalty should have the same length as the formula list (and applies to each element of formula). Smaller penalty on some formula will lead to elements in that formula to get more adjusted, i.e. to have a higher chance to get into the model (and not zero out). A penalty of 0 will make sure the element is included in the model. If not provided, assume the same penalty (1) for all variables. Defaults to None.

  • one_hot_encoding (bool, optional) – whether to encode all factor variables in the model matrix with almost_one_hot_encoding. This is recomended in case of using LASSO on the data (Default: False). one_hot_encoding_greater_3 creates one-hot-encoding for all categorical variables with more than 2 categories (i.e. the number of columns will be equal to the number of categories), and only 1 column for variables with 2 levels (treatment contrast). Defaults to False.

  • random_seed (int, optional) – Random seed to use. Defaults to 2020.

Raises:
  • Exception – f”Sample indicator only has value {_n_unique}. This can happen when your sample or target are empty from unknown reason”

  • NotImplementedError – if model is not “glmnet”

Returns:

A dictionary includes:

”weight” — The weights for the sample. “model” — parameters of the model:fit, performance, X_matrix_columns, lambda,

weight_trimming_mean_ratio

Shape of the Dict: {

”weight”: weights, “model”: {

”method”: “ipw”, “X_matrix_columns”: X_matrix_columns_names, “fit”: fit, “perf”: performance, “lambda”: best_s, “weight_trimming_mean_ratio”: weight_trimming_mean_ratio,

},

}

Return type:

Dict[str, Any]

Transform output of cvglmnetPredict(…, type=’link’) into weights, by exponentiating them, and optionally balancing the classes and trimming the weights, then normalize the weights to have sum equal to the sum of the target weights.

Parameters:
  • link (Any) – output of cvglmnetPredict(…, type=’link’)

  • balance_classes (bool) – whether balance_classes used in glmnet

  • sample_weights (pd.Series) – vector of sample weights

  • target_weights (pd.Series) – vector of sample weights

  • weight_trimming_mean_ratio (Union[None, float, int], optional) – to be used in trim_weights(). Defaults to None.

  • weight_trimming_percentile (Optional[float], optional) – to be used in trim_weights(). Defaults to None.

  • keep_sum_of_weights (bool, optional) – to be used in trim_weights(). Defaults to True.

Returns:

A vecotr of normalized weights (for sum of target weights)

Return type:

pd.Series