balance.weighting_methods.poststratify

balance.weighting_methods.poststratify.poststratify(sample_df: DataFrame, sample_weights: Series, target_df: DataFrame, target_weights: Series, variables: List[str] | None = None, transformations: Dict[str, Callable[[...], Any]] | str | None = 'default', transformations_drop: bool = True, strict_matching: bool = True, na_action: str = 'add_indicator', weight_trimming_mean_ratio: float | int | None = None, weight_trimming_percentile: float | None = None, keep_sum_of_weights: bool = True, *args: Any, formula: str | List[str] | None = None, store_fit_metadata: bool = False, **kwargs: Any) Dict[str, Any][source]

Perform cell-based post-stratification to adjust sample weights so that the sample matches the joint distribution of, one or more, specified variables in the target population.

This method computes one weight per cell - a unique combination of the supplied variables - so that the weighted sample reproduces the cell distribution observed in the target population. When more than one variable is supplied, the function operates on cells from the joint distribution (as opposed to raking, which operates on the marginals distribution).

Reference: https://docs.wfp.org/api/documents/WFP-0000121326/download/

Parameters:
  • sample_df (pd.DataFrame) – DataFrame representing the sample.

  • sample_weights (pd.Series) – Design weights for the sample.

  • target_df (pd.DataFrame) – DataFrame representing the target population.

  • target_weights (pd.Series) – Design weights for the target.

  • variables (Optional[List[str]], optional) – List of variables to define post-stratification cells. If None, uses the intersection of columns in sample_df and target_df.

  • transformations (Dict[str, Callable[..., Any]] | str | None, optional) – Transformations to apply to data before fitting the model. Accepts the same forms as balance.adjustment.apply_transformations(). Defaults to "default".

  • transformations_drop (bool, optional) – If True, drops variables not affected by transformations. Default is True.

  • strict_matching (bool, optional) – If True, requires all sample cells to be present in the target. If False, cells missing in the target are assigned weight 0 (and a warning is raised). Default is True.

  • na_action (str, optional) – How to handle missing values. Use "add_indicator" to treat missing values as their own category, or "drop" to remove rows with missing values from both sample and target. Defaults to "add_indicator".

  • weight_trimming_mean_ratio (Union[float, int, None], optional) – Forwarded to balance.adjustment.trim_weights() to clip weights at a multiple of the mean.

  • weight_trimming_percentile (Union[float, None], optional) – Percentile limit(s) for winsorisation, passed to balance.adjustment.trim_weights().

  • keep_sum_of_weights (bool, optional) – Preserve the sum of weights during trimming before the final normalisation to the target total. Defaults to True.

  • formula (Optional[Union[str, List[str]]], optional) –

    Formula-like specification to select post-stratification variables, as an alternative to variables. Supported operators are : (interaction), . (all common columns of sample and target), - (exclude a variable), and an optional leading ~ (the LHS is ignored). Examples: "a:b:c", ".", ". - c", "y ~ a:b", ["a", "b"] (list form joint-cells all items).

    Additive operators + and * are not supported and will raise ValueError. Post-stratification defines cells by the joint distribution of the selected variables — every variable added only refines the cell grid — so a + b, a * b and a:b would all produce identical cells. Rejecting +/* prevents users from silently writing a formula that looks like a main-effects model but is treated as a joint interaction. (Note: raking, unlike post-stratification, operates on marginals and will support additive formulas when it gains a formula= argument.)

    Parsing uses patsy operators for variable extraction only; general patsy transforms/functions (e.g., np.log(a)) are not supported. Mutually exclusive with non-empty variables.

  • *args – Additional positional arguments (currently unused).

  • store_fit_metadata (bool, optional) – Whether to include fit-time artifacts in the returned model dictionary so BalanceFrame.predict_weights() can reconstruct poststratification weights. Defaults to False.

  • **kwargs – Reserved for backward compatibility. Unknown keys raise TypeError to avoid silently ignoring typos.

Returns:

weight (pd.Series): Final weights for the sample. With

strict_matching=True (the default), every sample cell is also present in the target and the weights sum to the target’s total weight. With strict_matching=False, sample rows whose cell is absent from the target are assigned weight 0, so the weights sum to the total target weight restricted to cells that are present in the sample (i.e. target-only cells are effectively excluded).

model (dict): Description of the adjustment method used, with

optional fit metadata when store_fit_metadata=True.

Return type:

dict

Raises:

ValueError – If strict_matching is True and some sample cells are missing in the target.

Notes

  • The function expects that every combination of variables present in sample_df is also present in target_df. Set strict_matching=False to keep rows whose cell is missing in the target and assign them weight 0.

  • When no variables are provided, the intersection of columns in sample_df and target_df is used. In practice you will usually provide a small number of categorical variables (often one or two) describing the post-stratification cells.

Examples

Post-stratifying on a single categorical variable:

>>> import pandas as pd
>>> sample_df = pd.DataFrame({"gender": ["Female", "Male", "Female"]})
>>> target_df = pd.DataFrame({"gender": ["Female", "Female", "Male", "Male"]})
>>> design = pd.Series(1, index=sample_df.index)
>>> target_design = pd.Series(1, index=target_df.index)
>>> weights = poststratify(
...     sample_df=sample_df,
...     sample_weights=design,
...     target_df=target_df,
...     target_weights=target_design,
...     variables=["gender"],
... )["weight"]
>>> weights.tolist()
[1.0, 2.0, 1.0]

Post-stratifying on the joint distribution of two variables (the resulting weights depend on the combination of both columns rather than their marginals):

>>> sample_df = pd.DataFrame(
...     {
...         "gender": ["Female", "Female", "Male", "Male"],
...         "age_group": ["18-34", "35+", "18-34", "35+"],
...     }
... )
>>> target_df = pd.DataFrame(
...     {
...         "gender": ["Female", "Female", "Female", "Male", "Male", "Male"],
...         "age_group": ["18-34", "18-34", "35+", "18-34", "35+", "35+"],
...     }
... )
>>> design = pd.Series(1, index=sample_df.index)
>>> target_design = pd.Series(1, index=target_df.index)
>>> weights = poststratify(
...     sample_df=sample_df,
...     sample_weights=design,
...     target_df=target_df,
...     target_weights=target_design,
...     variables=["gender", "age_group"],
... )["weight"]
>>> weights.tolist()
[2.0, 1.0, 1.0, 2.0]