balance.weighting_methods.cbps

balance.weighting_methods.cbps.alpha_function(alpha: ndarray[Any, dtype[ScalarType]], beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]] | DataFrame, design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame) float | ndarray[Any, dtype[ScalarType]][source]

This is a helper function for cbps. It computes the gmm loss of alpha*beta.

Parameters:
  • alpha (np.ndarray) – multiplication factor

  • beta (np.ndarray) – vector of coefficients

  • X (Union[np.ndarray, pd.DataFrame]) – covariates matrix

  • design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target

  • in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target

Returns:

loss (float) computed gmm loss

Return type:

Union[float, np.ndarray]

balance.weighting_methods.cbps.bal_loss(beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]], design_weights: ndarray[Any, dtype[ScalarType]], in_pop: ndarray[Any, dtype[ScalarType]], XtXinv: ndarray[Any, dtype[ScalarType]]) float64[source]

This is a helper function for cbps. It computes the balance loss.

Parameters:
  • beta (np.ndarray) – vector of coefficients

  • X (np.ndarray) – Covariate matrix

  • design_weights (np.ndarray) – vector of design weights of sample and target

  • in_pop (np.ndarray) – indicator vector for target

  • XtXinv (np.ndarray) – (X.T %*% X)^(-1)

Returns:

computed balance loss

Return type:

np.float64

balance.weighting_methods.cbps.cbps(sample_df: DataFrame, sample_weights: Series, target_df: DataFrame, target_weights: Series, variables: List[str] | None = None, transformations: str = 'default', na_action: str = 'add_indicator', formula: str | List[str] | None = None, balance_classes: bool = True, cbps_method: str = 'over', max_de: float | None = None, opt_method: str = 'COBYLA', opt_opts: Dict | None = None, weight_trimming_mean_ratio: None | float | int = 20, weight_trimming_percentile: float | None = None, random_seed: int = 2020, *args, **kwargs) Dict[str, Series | Dict][source]

Fit cbps (covariate balancing propensity score model) for the sample using the target. Final weights are normalized to target size. We use a two-step GMM estimator (as in the default R package), unlike the suggeted continuous-updating estimator in the paper. The reason is that it runs much faster than the continuous one.

Paper: Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 243-263. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12027 R code source: https://github.com/kosukeimai/CBPS two-step GMM: https://en.wikipedia.org/wiki/Generalized_method_of_moments

Parameters:
  • sample_df (pd.DataFrame) – a dataframe representing the sample

  • sample_weights (pd.Series) – design weights for sample

  • target_df (pd.DataFrame) – a dataframe representing the target

  • target_weights (pd.Series) – design weights for target

  • variables (Optional[List[str]], optional) – list of variables to include in the model. If None all joint variables of sample_df and target_df are used. Defaults to None.

  • transformations (str, optional) – what transformations to apply to data before fitting the model. Default is “default” (see apply_transformations function). Defaults to “default”.

  • na_action (str, optional) – what to do with NAs. (see add_na_indicator function). Defaults to “add_indicator”.

  • formula (Optional[Union[str, List[str]]], optional) – The formula according to which build the model. In case of list of formula, the model matrix will be built in steps and concatenated together.. Defaults to None.

  • balance_classes (bool, optional) – whether to balance the sample and target size for running the model. True is preferable for imbalanced cases. Defaults to True.

  • cbps_method (str, optional) – method used for cbps. “over” fits an over-identified model that combines the propensity score and covariate balancing conditions; “exact” fits a model that only c ontains the covariate balancing conditions. Defaults to “over”.

  • max_de (Optional[float], optional) – upper bound for the design effect of the computed weights. Default is None.

  • opt_method (str, optional) – type of optimization solver. See scipy.optimize.minimize() for other options. Defaults to “COBYLA”.

  • opt_opts (Optional[Dict], optional) – A dictionary of solver options. Default is None. See scipy.optimize.minimize() for other options. Defaults to None.

  • weight_trimming_mean_ratio (Union[None, float, int], optional) – indicating the ratio from above according to which the weights are trimmed by mean(weights) * ratio. Defaults to 20.

  • weight_trimming_percentile (Optional[float], optional) – if weight_trimming_percentile is not none, winsorization is applied. Default is None, i.e. trimming is applied.

  • random_seed (int, optional) – a random seed. Defaults to 2020.

Raises:
  • Exception – _description_

  • Exception – _description_

  • Exception – _description_

  • Exception – _description_

Returns:

A dictionary includes: “weight” — The weights for the sample. “model” – dictionary with details about the fitted model:

X_matrix_columns, deviance, beta_optimal, balance_optimize_result, gmm_optimize_result_glm_init, gmm_optimize_result_bal_init It has the following shape: “model”: {

”method”: “cbps”, “X_matrix_columns”: X_matrix_columns_names, “deviance”: deviance, “original_sum_weights”: original_sum_weights, # This can be used to reconstruct the propensity probablities “beta_optimal”: beta_opt, “beta_init_glm”: beta_0, # The initial estimator by glm “gmm_init”: gmm_init, # The rescaled initial estimator # The following are the results of the optimizations “rescale_initial_result”: rescale_initial_result, “balance_optimize_result”: balance_optimize_result, “gmm_optimize_result_glm_init”: gmm_optimize_result_glm_init if cbps_method == “over” else None, “gmm_optimize_result_bal_init”: gmm_optimize_result_bal_init if cbps_method == “over” else None,

},

Return type:

Dict[str, Union[pd.Series, Dict]]

balance.weighting_methods.cbps.compute_deff_from_beta(X: ndarray[Any, dtype[ScalarType]], beta: ndarray[Any, dtype[ScalarType]], design_weights: ndarray[Any, dtype[ScalarType]], in_pop: ndarray[Any, dtype[ScalarType]]) float64[source]

This is a helper function for cbps. It computes the design effect of the estimated weights on the sample given a value of beta. It is used for setting a constraints on max_de.

Parameters:
  • X (np.ndarray) – covariates matrix

  • beta (np.ndarray) – vector of coefficients

  • design_weights (np.ndarray) – vector of design weights of sample and target

  • in_pop (np.ndarray) – indicator vector for target

Returns:

design effect

Return type:

np.float64

balance.weighting_methods.cbps.compute_pseudo_weights_from_logit_probs(probs: ndarray[Any, dtype[ScalarType]], design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame) ndarray[Any, dtype[ScalarType]][source]

This is a helper function for cbps. Given computed probs, it computes the weights: N/N_t * (in_pop - p_i)/(1 - p_i). (Note that these weights on sample are negative for convenience of notations)

Parameters:
  • probs (np.ndarray) – vector of probabilities

  • design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target

  • in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target

Returns:

np.ndarray of computed weights

Return type:

np.ndarray

balance.weighting_methods.cbps.gmm_function(beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]] | DataFrame, design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame, invV: ndarray[Any, dtype[ScalarType]] | None = None) Dict[str, float | ndarray[Any, dtype[ScalarType]]][source]

This is a helper function for cbps. It computes the gmm loss.

Parameters:
  • beta (np.ndarray) – vector of coefficients

  • X (Union[np.ndarray, pd.DataFrame]) – covariates matrix

  • design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target

  • in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target

  • invV (Union[np.ndarray, None], optional) – the inverse weighting matrix for GMM. Default is None.

Returns:

Dict with two items for loss and invV:

loss (float) computed gmm loss invV (np.ndarray) the weighting matrix for GMM

Return type:

Dict[str, Union[float, np.ndarray]]

balance.weighting_methods.cbps.gmm_loss(beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]] | DataFrame, design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame, invV: ndarray[Any, dtype[ScalarType]] | None = None) float | ndarray[Any, dtype[ScalarType]][source]

This is a helper function for cbps. It computes the gmm loss. See gmm_function for detials.

Parameters:
  • beta (np.ndarray) – vector of coefficients

  • X (Union[np.ndarray, pd.DataFrame]) – covariates matrix

  • design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target

  • in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target

  • invV (Union[np.ndarray, None], optional) – the inverse weighting matrix for GMM. Default is None.

Returns:

loss (float) computed gmm loss

Return type:

Union[float, np.ndarray]

balance.weighting_methods.cbps.logit_truncated(X: ndarray[Any, dtype[ScalarType]] | DataFrame, beta: ndarray[Any, dtype[ScalarType]], truncation_value: float = 1e-05) ndarray[Any, dtype[ScalarType]][source]

This is a helper function for cbps. Given an X matrx and avector of coeeficients beta, it computes the truncated version of the logit function.

Parameters:
  • X (Union[np.ndarray, pd.DataFrame]) – Covariate matrix

  • beta (np.ndarray) – vector of coefficients

  • truncation_value (float, optional) – upper and lower bound for the computed probabilities. Defaults to 1e-5.

Returns:

numpy array of computed probablities

Return type:

np.ndarray