balance.adjustment

balance.adjustment.apply_transformations(dfs: Tuple[DataFrame, ...], transformations: Dict[str, Callable] | str | None, drop: bool = True) Tuple[DataFrame, ...][source]
Apply the transformations specified in transformations to all of the dfs
  • if a column specified in transformations does not exist in the dataframes,

it is added - if a column is not specified in transformations, it is dropped, unless drop==False - the dfs are concatenated together before transformations are applied, so functions like max are relative to the column in all dfs - Cannot transform the same variable twice, or add a variable and then transform it (i.e. the definition of the added variable should include the transformation) - if you get a cryptic error about mismatched data types, make sure your transformations are not being treated as additions because of missing columns (use _set_warnings(“DEBUG”) to check)

Parameters:
  • dfs (Tuple[pd.DataFrame, ...]) – The DataFrames on which to operate

  • transformations (Union[Dict[str, Callable], str, None]) – Mapping from column name to function to apply. Transformations of existing columns should be specified as functions of those columns (e.g. lambda x: x*2), whereas additions of new columns should be specified as functions of the DataFrame (e.g. lambda x: x.column_a + x.column_b).

  • drop (bool, optional) – Whether to drop columns which are not specified in transformations. Defaults to True.

Raises:

NotImplementedError – When passing an unknown “transformations” argument.

Returns:

tuple of pd.DataFrames

Return type:

Tuple[pd.DataFrame, …]

Examples

from balance.adjustment import apply_transformations
import pandas as pd
import numpy as np

apply_transformations(
    (pd.DataFrame({'d': [1, 2, 3], 'e': [4, 5, 6]}),),
    {'d': lambda x: x*2, 'f': lambda x: x.d+x.e}
)

    # (   f  d
    #  0  5  2
    #  1  7  4
    #  2  9  6,)
balance.adjustment.default_transformations(dfs: Tuple[DataFrame, ...] | List[DataFrame]) Dict[str, Callable][source]

Apply default transformations to dfs, i.e. quantize to numeric columns and fct_lump to non-numeric and boolean

Parameters:

dfs (Union[Tuple[pd.DataFrame, ...], List[pd.DataFrame]]) – A list or tuple of dataframes

Returns:

Dict of transformations

Return type:

Dict[str, Callable]

balance.adjustment.trim_weights(weights: Series | ndarray[Any, dtype[ScalarType]], weight_trimming_mean_ratio: float | int | None = None, weight_trimming_percentile: float | None = None, verbose: bool = False, keep_sum_of_weights: bool = True) Series[source]

Trim extreme weights.

The user cannot supply both weight_trimming_mean_ratio and weight_trimming_percentile. If none are supplied, the original weights are returned.

If weight_trimming_mean_ratio is not None, the weights are trimmed from above by mean(weights) * ratio. The weights are then normalized to have the original mean. Note that trimmed weights aren’t actually bounded by trimming.ratio because the reduced weight is redistributed to arrive at the original mean.

If weight_trimming_percentile is not None, the weights are trimmed according to the percentiles of the distribution of the weights. Note that weight_trimming_percentile by default clips both sides of the distribution, unlike trimming that only trims the weights from above. For example, weight_trimming_percentile=0.1 trims below the 10th percentile AND above the 90th. If you only want to trim the upper side, specify weight_trimming_percentile = (0, 0.1). If you only want to trim the lower side, specify weight_trimming_percentile = (0.1, 0).

Parameters:
  • weights (Union[pd.Series, np.ndarray]) – pd.Series of weights to trim. np.ndarray will be turned into pd.Series) of weights.

  • weight_trimming_mean_ratio (Union[float, int], optional) – indicating the ratio from above according to which the weights are trimmed by mean(weights) * ratio. Defaults to None.

  • weight_trimming_percentile (Union[float], optional) – if weight_trimming_percentile is not None, then we apply winsorization using scipy.stats.mstats.winsorize(). Ranges between 0 and 1. If a single value is passed, indicates the percentiles on both sides of the weight distribution beyond which the weights will be winsorized. If two values are passed, the first value is the lower percentiles below which winsorizing will be applied, and the second is the 1. - upper percentile above which winsorizing will be applied. For example, weight_trimming_percentile=(0.01, 0.05) will trim the weights with values below the 1st percentile and above the 95th percentile of the weight distribution. See also: [https://en.wikipedia.org/wiki/Winsorizing]. Defaults to None.

  • verbose (bool, optional) – whether to add to logger printout of trimming process. Defaults to False.

  • keep_sum_of_weights (bool, optional) – Set if the sum of weights after trimming should be the same as the sum of weights before trimming. Defaults to True.

Raises:
  • TypeError – If weights is not np.array or pd.Series.

  • ValueError – If both weight_trimming_mean_ratio and weight_trimming_percentile are set.

Returns:

Trimmed weights

Return type:

pd.Series (of type float64)

Examples

import pandas as pd
from balance.adjustment import trim_weights
print(trim_weights(pd.Series(range(1, 101)), weight_trimming_mean_ratio = None))
    # 0       1.0
    # 1       2.0
    # 2       3.0
    # 3       4.0
    # 4       5.0
    #     ...
    # 95     96.0
    # 96     97.0
    # 97     98.0
    # 98     99.0
    # 99    100.0
    # Length: 100, dtype: float64

print(trim_weights(pd.Series(range(1, 101)), weight_trimming_mean_ratio = 1.5))
    # 0      1.064559
    # 1      2.129117
    # 2      3.193676
    # 3      4.258235
    # 4      5.322793
    #         ...
    # 95    80.640316
    # 96    80.640316
    # 97    80.640316
    # 98    80.640316
    # 99    80.640316
    # Length: 100, dtype: float64

print(pd.DataFrame(trim_weights(pd.Series(range(1, 101)), weight_trimming_percentile=.01)))
    # 0    2.0
    # 1    2.0
    # 2    3.0
    # 3    4.0
    # 4    5.0
    # ..   ...
    # 95  96.0
    # 96  97.0
    # 97  98.0
    # 98  99.0
    # 99  99.0
    # [100 rows x 1 columns]

print(pd.DataFrame(trim_weights(pd.Series(range(1, 101)), weight_trimming_percentile=(0., .05))))
    # 0    1.002979
    # 1    2.005958
    # 2    3.008937
    # 3    4.011917
    # 4    5.014896
    # ..        ...
    # 95  95.283019
    # 96  95.283019
    # 97  95.283019
    # 98  95.283019
    # 99  95.283019