balance.adjustment

balance.adjustment.apply_transformations(dfs: Tuple[DataFrame, ...], transformations: Dict[str, Callable] | str | None, drop: bool = True) Tuple[DataFrame, ...][source]
Apply the transformations specified in transformations to all of the dfs
  • if a column specified in transformations does not exist in the dataframes,

it is added - if a column is not specified in transformations, it is dropped, unless drop==False - the dfs are concatenated together before transformations are applied, so functions like max are relative to the column in all dfs - Cannot transform the same variable twice, or add a variable and then transform it (i.e. the definition of the added variable should include the transformation) - if you get a cryptic error about mismatched data types, make sure your transformations are not being treated as additions because of missing columns (use _set_warnings(“DEBUG”) to check)

Parameters:
  • dfs (Tuple[pd.DataFrame, ...]) – The DataFrames on which to operate

  • transformations (Union[Dict[str, Callable], str, None]) – Mapping from column name to function to apply. Transformations of existing columns should be specified as functions of those columns (e.g. lambda x: x*2), whereas additions of new columns should be specified as functions of the DataFrame (e.g. lambda x: x.column_a + x.column_b).

  • drop (bool, optional) – Whether to drop columns which are not specified in transformations. Defaults to True.

Raises:

NotImplementedError – When passing an unknown “transformations” argument.

Returns:

tuple of pd.DataFrames

Return type:

Tuple[pd.DataFrame, …]

Examples

from balance.adjustment import apply_transformations
import pandas as pd
import numpy as np

apply_transformations(
    (pd.DataFrame({'d': [1, 2, 3], 'e': [4, 5, 6]}),),
    {'d': lambda x: x*2, 'f': lambda x: x.d+x.e}
)

    # (   f  d
    #  0  5  2
    #  1  7  4
    #  2  9  6,)
balance.adjustment.default_transformations(dfs: Tuple[DataFrame, ...] | List[DataFrame]) Dict[str, Callable][source]

Apply default transformations to dfs, i.e. quantize to numeric columns and fct_lump to non-numeric and boolean

Parameters:

dfs (Union[Tuple[pd.DataFrame, ...], List[pd.DataFrame]]) – A list or tuple of dataframes

Returns:

Dict of transformations

Return type:

Dict[str, Callable]

balance.adjustment.trim_weights(weights: Series | ndarray[Any, dtype[_ScalarType_co]], weight_trimming_mean_ratio: float | int | None = None, weight_trimming_percentile: float | None = None, verbose: bool = False, keep_sum_of_weights: bool = True, target_sum_weights: float | int | floating | None = None) Series[source]

Trim extreme weights using mean ratio clipping or percentile-based winsorization.

The user cannot supply both weight_trimming_mean_ratio and weight_trimming_percentile. If neither is supplied, the original weights are returned unchanged.

Mean Ratio Trimming (weight_trimming_mean_ratio): When specified, weights are clipped from above at mean(weights) * ratio, then renormalized to preserve the original mean. This is a hard upper bound. Note: Final weights may slightly exceed the trimming ratio due to renormalization redistributing the clipped weight mass across all observations.

Percentile-Based Winsorization (weight_trimming_percentile): When specified, extreme weights are replaced with less extreme values using scipy.stats.mstats.winsorize. By default, winsorization affects both tails of the distribution symmetrically, unlike mean ratio trimming which only clips from above.

Behavior: - Single value (e.g., 0.1): Winsorizes below 10th AND above 90th percentile - Tuple (lower, upper): Winsorizes independently on each side

  • (0.1, 0): Only winsorizes below 10th percentile

  • (0, 0.1): Only winsorizes above 90th percentile

  • (0.01, 0.05): Winsorizes below 1st AND above 95th percentile

Important implementation detail: Percentile limits are automatically adjusted upward slightly (via _validate_limit) to ensure at least one value gets winsorized at boundary percentiles. This prevents edge cases where discrete distributions or floating-point precision might prevent winsorization at the exact percentile value. The adjustment is min(2/n_weights, limit/10), capped at 1.0.

After trimming/winsorization, if keep_sum_of_weights=True (default), weights are rescaled to preserve the original sum of weights. Alternatively, pass a target_sum_weights to rescale the trimmed weights so their sum matches a desired total.

Parameters:
  • weights (Union[pd.Series, np.ndarray]) – Weights to trim. np.ndarray will be converted to pd.Series internally.

  • weight_trimming_mean_ratio (Union[float, int], optional) – Ratio for upper bound clipping as mean(weights) * ratio. Mutually exclusive with weight_trimming_percentile. Defaults to None.

  • weight_trimming_percentile (Union[float, Tuple[float, float]], optional) –

    Percentile limits for winsorization. Value(s) must be between 0 and 1. - Single float: Symmetric winsorization on both tails - Tuple[float, float]: (lower_percentile, upper_percentile) for

    independent control of each tail

    Mutually exclusive with weight_trimming_mean_ratio. Defaults to None.

  • verbose (bool, optional) – Whether to log details about the trimming process. Defaults to False.

  • keep_sum_of_weights (bool, optional) – Whether to rescale weights after trimming to preserve the original sum of weights. Defaults to True.

  • target_sum_weights (Union[float, int, np.floating, None], optional) – If provided, rescale the trimmed weights so their sum equals this target. None (default) leaves the post-trimming sum unchanged.

Raises:
  • TypeError – If weights is not np.array or pd.Series.

  • ValueError – If both weight_trimming_mean_ratio and weight_trimming_percentile are specified, or if weight_trimming_percentile tuple has length != 2.

Returns:

Trimmed weights with the same index as input

Return type:

pd.Series (of type float64)

Examples

import pandas as pd
from balance.adjustment import trim_weights
print(trim_weights(pd.Series(range(1, 101)), weight_trimming_mean_ratio = None))
    # 0       1.0
    # 1       2.0
    # 2       3.0
    # 3       4.0
    # 4       5.0
    #     ...
    # 95     96.0
    # 96     97.0
    # 97     98.0
    # 98     99.0
    # 99    100.0
    # Length: 100, dtype: float64

print(trim_weights(pd.Series(range(1, 101)), weight_trimming_mean_ratio = 1.5))
    # 0      1.064559
    # 1      2.129117
    # 2      3.193676
    # 3      4.258235
    # 4      5.322793
    #         ...
    # 95    80.640316
    # 96    80.640316
    # 97    80.640316
    # 98    80.640316
    # 99    80.640316
    # Length: 100, dtype: float64

print(pd.DataFrame(trim_weights(pd.Series(range(1, 101)), weight_trimming_percentile=.01)))
    # 0    2.0
    # 1    2.0
    # 2    3.0
    # 3    4.0
    # 4    5.0
    # ..   ...
    # 95  96.0
    # 96  97.0
    # 97  98.0
    # 98  99.0
    # 99  99.0
    # [100 rows x 1 columns]

print(pd.DataFrame(trim_weights(pd.Series(range(1, 101)), weight_trimming_percentile=(0., .05))))
    # 0    1.002979
    # 1    2.005958
    # 2    3.008937
    # 3    4.011917
    # 4    5.014896
    # ..        ...
    # 95  95.283019
    # 96  95.283019
    # 97  95.283019
    # 98  95.283019
    # 99  95.283019