# Adjusting Sample to Population

To produce the balancing weights, use the `Sample.adjust()`

method to adjust a sample to population:

`adjusted = sample.adjust()`

The output of this method is an adjusted `Sample`

class object of the form:

` Adjusted balance Sample object with target set using ipw`

1000 observations x 3 variables: gender,age_group,income

id_column: id, weight_column: weight,

outcome_columns: happiness

target:

balance Sample object

10000 observations x 3 variables: gender,age_group,income

id_column: id, weight_column: weight,

outcome_columns: None

3 common variables: income,gender,age_group

Note that the `adjust`

method in balance is performing three main steps:

**Pre-processing**of the data - getting data ready for adjustment using best practices in the field:- Handling missing values - balance handles missing values automatically by adding a column '_is_na' to any variable that contains missing values. The advantage of this is that these are then considered as a separate category for the adjustment.
- Feature engineering - by default, balance applies feature engineering to be able to fit the covariate distribution better, and not only the first moment. Specifically, each continues variable is bucketed into 10 quantiles buckets. Furthermore, rare categories in categorical variables are grouped together so to avoid overfitting rare events.

**Fitting the model**and calculating the weights: the model fitted depends on the`method`

chosen by the user. Current options are inverse propensity score weighting using regularized logistic regression (`ipw`

), covariate balancing propensity score (`cbps`

), post-stratification (`poststratify`

), and raking (`rake`

).**Post-processing**of the weights:- Trimming weights - balance trims the weights in order to avoid over fitting of the model and unnecessary variance inflation.
- Normalizing weights to population size. The resulting weights of balance can be described as approximating the number of unit in the population this unit of the sample represents.

## Optional arguments

:`method`

`ipw`

,`poststratify`

,`rake`

, or`cbps`

. Default is`ipw`

.`ipw`

: stands for Inverse Propensity Weighting. The propensity scores are calculated with LASSO logistic regression. Details about the implementation can be found here. For a quick-start tutorial, see here.`cbps`

: stands for Covariate Balancing Propensity Score. The CBPS algorithm estimates the propensity score in a way that optimizes prediction of the probability of sample inclusion as well as the covariates balance. Its main advantage is in cases when the researcher wants better balance on the covariates than traditional propensity score methods - because one believes the assignment model might be misspecified and would like to avoid an iterative procedure of balancing the covariates. Details about the implementation can be found here. For a quick-start tutorial, see here.`poststratify`

: stands for post-stratification. Details about the implementation can be found here.`rake`

: Details about the implementation can be found here. For a quick-start tutorial, see here.

: allows user to pass a list of the covariates that they want to adjust for; if variables argument is not specified, all joint variables in sample and target are used.`variables`

: which transformations to apply to data before fitting the model. Default is cutting numeric variables into 10 quantile buckets and lumping together infrequent levels with less than 5% prevalence into`transformations`

`lumped_other`

category. The transformations are done on both the sample dataframe and the target dataframe together. User can also specify specific transformations in a dictionary format. For a quick-start tutorial on transformations and formulas, see here.: (for`max_de`

`ipw`

and`cbps`

methods): The default value is 1.5. It limits the**design effect**to be within 1.5. If set to None, the optimization is performed by cross-validation of the logistic model for ipw (see the`choose_regularization`

function for more details) or without constrained optimization for cbps. Setting`max_de`

to`None`

can sometimes significantly improve the running time of the code.`weight_trimming_mean_ratio`

**or**: (only one of these arguments can be specified).`weight_trimming_percentile`

`weight_trimming_mean_ratio`

indicates the ratio from above according to which the weights are trimmed by mean(weights) * ratio. Default is 20. If`weight_trimming_percentile`

is not none, winsorization is applied. Default is None, i.e. trimming from above is applied. However, note that when`max_de`

is not None (and default is 1.5), the trimming-ratio is optimized by`ipw`

and these arguments are ignored.(for`na_action`

`ipw`

method): how to handle missing values in the data (sample and target). Default is to replace NAs with 0's and add indicator for which observations were NA (this is done after applying the transformations). Another option is`drop`

, which drops all observations with NA values.(for`formula`

`ipw`

and`cbps`

methods): The formula according to which build the model matrix for the logistic regression. Default is a linear additive formula of all covariates. For a quick-start tutorial on transformations and formulas, see here.(for`penalty_factor`

`ipw`

method): the penalty used in the regularized logistic regression.