To produce the balancing weights, use the
Sample.adjust() method to adjust a sample to population:
adjusted = sample.adjust()
The output of this method is an adjusted
Sample class object of the form:
Adjusted balance Sample object with target set using ipw
1000 observations x 3 variables: gender,age_group,income
id_column: id, weight_column: weight,
balance Sample object
10000 observations x 3 variables: gender,age_group,income
id_column: id, weight_column: weight,
3 common variables: income,gender,age_group
Note that the
adjust method in balance is performing three main steps:
- Pre-processing of the data - getting data ready for adjustment using best practices in the field:
- Handling missing values - balance handles missing values automatically by adding a column '_is_na' to any variable that contains missing values. The advantage of this is that these are then considered as a separate category for the adjustment.
- Feature engineering - by default, balance applies feature engineering to be able to fit the covariate distribution better, and not only the first moment. Specifically, each continues variable is bucketed into 10 quantiles buckets. Furthermore, rare categories in categorical variables are grouped together so to avoid overfitting rare events.
- Fitting the model and calculating the weights: the model fitted depends on the
methodchosen by the user. Current options are inverse propensity score weighting using regularized logistic regression (
ipw), covariate balancing propensity score (
cbps), post-stratification (
poststratify), and raking (
- Post-processing of the weights:
- Trimming weights - balance trims the weights in order to avoid over fitting of the model and unnecessary variance inflation.
- Normalizing weights to population size. The resulting weights of balance can be described as approximating the number of unit in the population this unit of the sample represents.
cbps. Default is
ipw: stands for Inverse Propensity Weighting. The propensity scores are calculated with LASSO logistic regression. Details about the implementation can be found here. For a quick-start tutorial, see here.
cbps: stands for Covariate Balancing Propensity Score. The CBPS algorithm estimates the propensity score in a way that optimizes prediction of the probability of sample inclusion as well as the covariates balance. Its main advantage is in cases when the researcher wants better balance on the covariates than traditional propensity score methods - because one believes the assignment model might be misspecified and would like to avoid an iterative procedure of balancing the covariates. Details about the implementation can be found here. For a quick-start tutorial, see here.
poststratify: stands for post-stratification. Details about the implementation can be found here.
rake: Details about the implementation can be found here. For a quick-start tutorial, see here.
variables: allows user to pass a list of the covariates that they want to adjust for; if variables argument is not specified, all joint variables in sample and target are used.
transformations: which transformations to apply to data before fitting the model. Default is cutting numeric variables into 10 quantile buckets and lumping together infrequent levels with less than 5% prevalence into
lumped_othercategory. The transformations are done on both the sample dataframe and the target dataframe together. User can also specify specific transformations in a dictionary format. For a quick-start tutorial on transformations and formulas, see here.
cbpsmethods): The default value is 1.5. It limits the design effect to be within 1.5. If set to None, the optimization is performed by cross-validation of the logistic model for ipw (see the
choose_regularizationfunction for more details) or without constrained optimization for cbps. Setting
Nonecan sometimes significantly improve the running time of the code.
weight_trimming_percentile: (only one of these arguments can be specified).
weight_trimming_mean_ratioindicates the ratio from above according to which the weights are trimmed by mean(weights) * ratio. Default is 20. If
weight_trimming_percentileis not none, winsorization is applied. Default is None, i.e. trimming from above is applied. However, note that when
max_deis not None (and default is 1.5), the trimming-ratio is optimized by
ipwand these arguments are ignored.
ipwmethod): how to handle missing values in the data (sample and target). Default is to replace NAs with 0's and add indicator for which observations were NA (this is done after applying the transformations). Another option is
drop, which drops all observations with NA values.
cbpsmethods): The formula according to which build the model matrix for the logistic regression. Default is a linear additive formula of all covariates. For a quick-start tutorial on transformations and formulas, see here.
ipwmethod): the penalty used in the regularized logistic regression.