balance.weighting_methods.cbps¶
- balance.weighting_methods.cbps.alpha_function(alpha: ndarray[Any, dtype[ScalarType]], beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]] | DataFrame, design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame) float | ndarray[Any, dtype[ScalarType]] [source]¶
This is a helper function for cbps. It computes the gmm loss of alpha*beta.
- Parameters:
alpha (np.ndarray) – multiplication factor
beta (np.ndarray) – vector of coefficients
X (Union[np.ndarray, pd.DataFrame]) – covariates matrix
design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target
in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target
- Returns:
loss (float) computed gmm loss
- Return type:
Union[float, np.ndarray]
- balance.weighting_methods.cbps.bal_loss(beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]], design_weights: ndarray[Any, dtype[ScalarType]], in_pop: ndarray[Any, dtype[ScalarType]], XtXinv: ndarray[Any, dtype[ScalarType]]) float64 [source]¶
This is a helper function for cbps. It computes the balance loss.
- Parameters:
beta (np.ndarray) – vector of coefficients
X (np.ndarray) – Covariate matrix
design_weights (np.ndarray) – vector of design weights of sample and target
in_pop (np.ndarray) – indicator vector for target
XtXinv (np.ndarray) – (X.T %*% X)^(-1)
- Returns:
computed balance loss
- Return type:
np.float64
- balance.weighting_methods.cbps.cbps(sample_df: DataFrame, sample_weights: Series, target_df: DataFrame, target_weights: Series, variables: List[str] | None = None, transformations: str = 'default', na_action: str = 'add_indicator', formula: str | List[str] | None = None, balance_classes: bool = True, cbps_method: str = 'over', max_de: float | None = None, opt_method: str = 'COBYLA', opt_opts: Dict | None = None, weight_trimming_mean_ratio: None | float | int = 20, weight_trimming_percentile: float | None = None, random_seed: int = 2020, *args, **kwargs) Dict[str, Series | Dict] [source]¶
Fit cbps (covariate balancing propensity score model) for the sample using the target. Final weights are normalized to target size. We use a two-step GMM estimator (as in the default R package), unlike the suggeted continuous-updating estimator in the paper. The reason is that it runs much faster than the continuous one.
Paper: Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 243-263. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12027 R code source: https://github.com/kosukeimai/CBPS two-step GMM: https://en.wikipedia.org/wiki/Generalized_method_of_moments
- Parameters:
sample_df (pd.DataFrame) – a dataframe representing the sample
sample_weights (pd.Series) – design weights for sample
target_df (pd.DataFrame) – a dataframe representing the target
target_weights (pd.Series) – design weights for target
variables (Optional[List[str]], optional) – list of variables to include in the model. If None all joint variables of sample_df and target_df are used. Defaults to None.
transformations (str, optional) – what transformations to apply to data before fitting the model. Default is “default” (see apply_transformations function). Defaults to “default”.
na_action (str, optional) – what to do with NAs. (see add_na_indicator function). Defaults to “add_indicator”.
formula (Optional[Union[str, List[str]]], optional) – The formula according to which build the model. In case of list of formula, the model matrix will be built in steps and concatenated together.. Defaults to None.
balance_classes (bool, optional) – whether to balance the sample and target size for running the model. True is preferable for imbalanced cases. Defaults to True.
cbps_method (str, optional) – method used for cbps. “over” fits an over-identified model that combines the propensity score and covariate balancing conditions; “exact” fits a model that only c ontains the covariate balancing conditions. Defaults to “over”.
max_de (Optional[float], optional) – upper bound for the design effect of the computed weights. Default is None.
opt_method (str, optional) – type of optimization solver. See
scipy.optimize.minimize()
for other options. Defaults to “COBYLA”.opt_opts (Optional[Dict], optional) – A dictionary of solver options. Default is None. See
scipy.optimize.minimize()
for other options. Defaults to None.weight_trimming_mean_ratio (Union[None, float, int], optional) – indicating the ratio from above according to which the weights are trimmed by mean(weights) * ratio. Defaults to 20.
weight_trimming_percentile (Optional[float], optional) – if weight_trimming_percentile is not none, winsorization is applied. Default is None, i.e. trimming is applied.
random_seed (int, optional) – a random seed. Defaults to 2020.
- Raises:
Exception – _description_
Exception – _description_
Exception – _description_
Exception – _description_
- Returns:
A dictionary includes: “weight” — The weights for the sample. “model” – dictionary with details about the fitted model:
X_matrix_columns, deviance, beta_optimal, balance_optimize_result, gmm_optimize_result_glm_init, gmm_optimize_result_bal_init It has the following shape: “model”: {
”method”: “cbps”, “X_matrix_columns”: X_matrix_columns_names, “deviance”: deviance, “original_sum_weights”: original_sum_weights, # This can be used to reconstruct the propensity probablities “beta_optimal”: beta_opt, “beta_init_glm”: beta_0, # The initial estimator by glm “gmm_init”: gmm_init, # The rescaled initial estimator # The following are the results of the optimizations “rescale_initial_result”: rescale_initial_result, “balance_optimize_result”: balance_optimize_result, “gmm_optimize_result_glm_init”: gmm_optimize_result_glm_init if cbps_method == “over” else None, “gmm_optimize_result_bal_init”: gmm_optimize_result_bal_init if cbps_method == “over” else None,
},
- Return type:
Dict[str, Union[pd.Series, Dict]]
- balance.weighting_methods.cbps.compute_deff_from_beta(X: ndarray[Any, dtype[ScalarType]], beta: ndarray[Any, dtype[ScalarType]], design_weights: ndarray[Any, dtype[ScalarType]], in_pop: ndarray[Any, dtype[ScalarType]]) float64 [source]¶
This is a helper function for cbps. It computes the design effect of the estimated weights on the sample given a value of beta. It is used for setting a constraints on max_de.
- Parameters:
X (np.ndarray) – covariates matrix
beta (np.ndarray) – vector of coefficients
design_weights (np.ndarray) – vector of design weights of sample and target
in_pop (np.ndarray) – indicator vector for target
- Returns:
design effect
- Return type:
np.float64
- balance.weighting_methods.cbps.compute_pseudo_weights_from_logit_probs(probs: ndarray[Any, dtype[ScalarType]], design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame) ndarray[Any, dtype[ScalarType]] [source]¶
This is a helper function for cbps. Given computed probs, it computes the weights: N/N_t * (in_pop - p_i)/(1 - p_i). (Note that these weights on sample are negative for convenience of notations)
- Parameters:
probs (np.ndarray) – vector of probabilities
design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target
in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target
- Returns:
np.ndarray of computed weights
- Return type:
np.ndarray
- balance.weighting_methods.cbps.gmm_function(beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]] | DataFrame, design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame, invV: ndarray[Any, dtype[ScalarType]] | None = None) Dict[str, float | ndarray[Any, dtype[ScalarType]]] [source]¶
This is a helper function for cbps. It computes the gmm loss.
- Parameters:
beta (np.ndarray) – vector of coefficients
X (Union[np.ndarray, pd.DataFrame]) – covariates matrix
design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target
in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target
invV (Union[np.ndarray, None], optional) – the inverse weighting matrix for GMM. Default is None.
- Returns:
- Dict with two items for loss and invV:
loss (float) computed gmm loss invV (np.ndarray) the weighting matrix for GMM
- Return type:
Dict[str, Union[float, np.ndarray]]
- balance.weighting_methods.cbps.gmm_loss(beta: ndarray[Any, dtype[ScalarType]], X: ndarray[Any, dtype[ScalarType]] | DataFrame, design_weights: ndarray[Any, dtype[ScalarType]] | DataFrame, in_pop: ndarray[Any, dtype[ScalarType]] | DataFrame, invV: ndarray[Any, dtype[ScalarType]] | None = None) float | ndarray[Any, dtype[ScalarType]] [source]¶
This is a helper function for cbps. It computes the gmm loss. See gmm_function for detials.
- Parameters:
beta (np.ndarray) – vector of coefficients
X (Union[np.ndarray, pd.DataFrame]) – covariates matrix
design_weights (Union[np.ndarray, pd.DataFrame]) – vector of design weights of sample and target
in_pop (Union[np.ndarray, pd.DataFrame]) – indicator vector for target
invV (Union[np.ndarray, None], optional) – the inverse weighting matrix for GMM. Default is None.
- Returns:
loss (float) computed gmm loss
- Return type:
Union[float, np.ndarray]
- balance.weighting_methods.cbps.logit_truncated(X: ndarray[Any, dtype[ScalarType]] | DataFrame, beta: ndarray[Any, dtype[ScalarType]], truncation_value: float = 1e-05) ndarray[Any, dtype[ScalarType]] [source]¶
This is a helper function for cbps. Given an X matrx and avector of coeeficients beta, it computes the truncated version of the logit function.
- Parameters:
X (Union[np.ndarray, pd.DataFrame]) – Covariate matrix
beta (np.ndarray) – vector of coefficients
truncation_value (float, optional) – upper and lower bound for the computed probabilities. Defaults to 1e-5.
- Returns:
numpy array of computed probablities
- Return type:
np.ndarray