balance.weighting_methods.rake

balance.weighting_methods.rake.prepare_marginal_dist_for_raking(dict_of_dicts: Dict[str, Dict[str, float]], max_length: int = 10000) DataFrame[source]

Realizes a nested dictionary of proportions into a DataFrame.

Parameters:
  • dict_of_dicts – A nested dictionary where the outer keys are column names and the inner dictionaries have keys as category labels and values as their proportions (real numbers).

  • max_length – Maximum number of rows in the resulting DataFrame. Must be an integer. When the natural least common multiple (LCM) based row count would exceed this value the output is capped using Hare-Niemeyer (largest remainder) allocation. Default is 10000.

Returns:

A DataFrame with columns specified by the outer keys of the input dictionary and rows containing the category labels according to their proportions. An additional “id” column is added with integer values as row identifiers.

Examples: .. code-block:: python

from balance.weighting_methods.rake import prepare_marginal_dist_for_raking df = prepare_marginal_dist_for_raking(

{“A”: {“a”: 0.5, “b”: 0.5}, “B”: {“x”: 0.2, “y”: 0.8}}

) df.columns.tolist() # [‘A’, ‘B’, ‘id’]

balance.weighting_methods.rake.rake(sample_df: DataFrame, sample_weights: Series, target_df: DataFrame, target_weights: Series, variables: List[str] | None = None, transformations: Dict[str, Callable[[...], Any]] | str | None = 'default', na_action: str = 'add_indicator', max_iteration: int = 1000, convergence_rate: float = 0.0005, rate_tolerance: float = 1e-08, weight_trimming_mean_ratio: float | int | None = None, weight_trimming_percentile: float | None = None, keep_sum_of_weights: bool = True, *args: Any, store_fit_metadata: bool = False, **kwargs: Any) Dict[str, Any][source]

Perform raking (using the iterative proportional fitting algorithm). See: https://en.wikipedia.org/wiki/Iterative_proportional_fitting

Returns weights normalised to sum of target weights

Arguments: sample_df — (pandas dataframe) a dataframe representing the sample. sample_weights — (pandas series) design weights for sample. target_df — (pandas dataframe) a dataframe representing the target. target_weights — (pandas series) design weights for target. variables — (list of strings) list of variables to include in the model.

If None all joint variables of sample_df and target_df are used.

transformations — (dict) what transformations to apply to data before fitting the model.

Default is “default” (see apply_transformations function).

na_action — (string) what to do with NAs. Default is “add_indicator”, which adds NaN as a

group (called “__NaN__”) for each weighting variable (post-transformation); “drop” removes rows with any missing values on any variable from both sample and target.

max_iteration — (int) maximum number of iterations for iterative proportional fitting algorithm convergence_rate — (float) convergence criteria; the maximum difference in proportions between

sample and target marginal distribution on any covariate in order for algorithm to converge.

rate_tolerance — (float) convergence criteria; if convergence rate does not move more

than this amount than the algorithm is also considered to have converged.

weight_trimming_mean_ratio — (float, int, optional) upper bound for weights expressed as a

multiple of the mean weight. Delegated to balance.adjustment.trim_weights().

weight_trimming_percentile — (float, optional) percentile limit(s) for winsorisation.

Delegated to balance.adjustment.trim_weights().

keep_sum_of_weights — (bool, optional) preserve the sum of weights during trimming before

rescaling to the target total. Defaults to True.

store_fit_metadata — (bool, optional, keyword-only)

when True, persist fit-time artifacts in model for BalanceFrame.predict_weights() replay/transfer workflows. Defaults to False.

Returns: A dictionary including: “weight” — The weights for the sample. “model” — parameters of the model: iterations (dataframe with iteration numbers and

convergence rate information at all steps), converged (Flag with the output status: 0 for failure and 1 for success). When store_fit_metadata=True it also includes fit-time artifacts for BalanceFrame.predict_weights() reconstruction.

Notes: When exactly one adjustment variable is selected (either explicitly via variables=[...] or implicitly because only one common variable exists), this function delegates to balance.weighting_methods.poststratify.poststratify(). In that fallback path, the returned model metadata records method='poststratify' and the returned weight series is renamed to rake_weight for API consistency. Because BalanceFrame.predict_weights(data=...) dispatches by model['method'], delegated fits follow poststratify’s transfer-scoring capabilities/limitations rather than rake’s.

BalanceFrame.predict_weights() for rake reuses the fitted cell-ratio surface from this function (effectively m_fit / m_sample per joint cell) and applies it to design weights in the scoring sample. This is exact in-place replay (same sample rows as fit), but for data=... it is a transfer operation whose validity depends on the new sample having a similar joint distribution over rake variables as the training sample. If the joint distribution diverges, transferred rake weights can fail to recover target marginals even though the same fitted model artifacts are used. In that case, re-fit rake on the new sample against the same target. For this reason, balance emits an unconditional warning on transferred scoring (predict_weights(data=...)) when the fit is otherwise replayable, and raises for known unreplayable cases — currently transformations='default' and explicit dicts containing the known data-dependent helpers (quantize, fct_lump).

Examples: .. code-block:: python

import pandas as pd from balance.weighting_methods.rake import rake sample_df = pd.DataFrame({“x”: [“a”, “b”]}) target_df = pd.DataFrame({“x”: [“a”, “b”]}) sample_weights = pd.Series([1.0, 1.0]) target_weights = pd.Series([1.0, 1.0]) result = rake(sample_df, sample_weights, target_df, target_weights, variables=[“x”]) result[“weight”].tolist() # [1.0, 1.0]