balance.stats_and_plots.weights_stats

balance.stats_and_plots.weights_stats.design_effect(w: Series) float64[source]

Kish’s design effect measure.

The design effect is a number that shows how well a sample of people may represent a larger group of people for a specific measure of interest (such as the mean). Kish’s design effect gives the increase in the variance of the weighted mean based on “haphazard” weights.

The inverse of the design effect is the effective sample size ratio.

For details, see: Tal Galili (5 May 2024). “Design effect”. WikiJournal of Science 7 (1): 4. doi:10.15347/wjs/2024.004. Wikidata Q116768211. ISSN 2470-6345. https://en.wikipedia.org/wiki/Design_effect

Parameters:

w (pd.Series) – A pandas series of weights (non negative, float/int) values.

Returns:

An estimator saying by how much the variance of the mean is expected to increase, compared to a random sample mean, due to application of the weights.

Return type:

np.float64

Examples

from balance.stats_and_plots.weights_stats import design_effect
import pandas as pd

design_effect(pd.Series((0, 1, 2, 3)))
    # output:
    # 1.5555555555555556
design_effect(pd.Series((1, 1, 1000)))
    # 2.9880418803112336
    # As expected. With a single dominating weight - the Deff is almost equal to the sample size.
balance.stats_and_plots.weights_stats.nonparametric_skew(w: Series) float[source]

The nonparametric skew is the difference between the mean and the median, divided by the standard deviation. See: - https://en.wikipedia.org/wiki/Nonparametric_skew

Parameters:

w (pd.Series) – A pandas series of weights (non negative, float/int) values.

Returns:

A value of skew, between -1 to 1, but for weights it’s often positive (i.e.: right tailed distribution). The value returned will be 0 if the standard deviation is 0 (i.e.: all values are identical), or if the input is of length 1.

Return type:

np.float64

Examples

from balance.stats_and_plots.weights_stats import nonparametric_skew

nonparametric_skew(pd.Series((1, 1, 1, 1)))  # 0
nonparametric_skew(pd.Series((1)))           # 0
nonparametric_skew(pd.Series((1, 2, 3, 4)))  # 0
nonparametric_skew(pd.Series((1, 1, 1, 2)))  # 0.5
nonparametric_skew(pd.Series((-1,1,1, 1)))   #-0.5
balance.stats_and_plots.weights_stats.prop_above_and_below(w: Series, below: Tuple[float, ...] | List[float] | None = (0.1, 0.2, 0.3333333333333333, 0.5, 1), above: Tuple[float, ...] | List[float] | None = (1, 2, 3, 5, 10), return_as_series: bool = True) Series | Dict[Any, Any] | None[source]

The proportion of weights, normalized to sample size, that are above and below some numbers (E.g. 1,2,3,5,10 and their inverse: 1, 1/2, 1/3, etc.). This is similar to returning percentiles of the (normalized) weighted distribution. But instead of focusing on the 25th percentile, the median, etc, We focus instead on more easily interpretable weights values.

For example, saying that some proportion of users had a weight of above 1 gives us an indication of how many users we got that we don’t “loose” their value after using the weights. Saying which proportion of users had a weight below 1/10 tells us how many users had basically almost no contribution to the final analysis (after applying the weights).

Note that below and above can overlap, be unordered, etc. The user is responsible for the order.

Parameters:
  • w (pd.Series) – A pandas series of weights (float, non negative) values.

  • below (Union[Tuple[float, ...], List[float], None], optional) – values to check which proportion of normalized weights are below them. Using None returns None. Defaults to (1/10, 1/5, 1/3, 1/2, 1).

  • above (Union[Tuple[float, ...], List[float], None], optional) – values to check which proportion of normalized weights are above (or equal) to them. Using None returns None. Defaults to (1, 2, 3, 5, 10).

  • return_as_series (bool, optional) – If true returns one pd.Series of values. If False will return a dict with two pd.Series (one for below and one for above). Defaults to True.

Returns:

If return_as_series is True we get pd.Series with proportions of (normalized weights) that are below/above some numbers, the index indicates which threshold was checked (the values in the index are rounded up to 3 points for printing purposes). If return_as_series is False we get a dict with ‘below’ and ‘above’ with the relevant pd.Series (or None).

Return type:

Union[pd.Series, Dict]

Examples

from balance.stats_and_plots.weights_stats import prop_above_and_below
import pandas as pd

# normalized weights:
print(pd.Series((1, 2, 3, 4)) / pd.Series((1, 2, 3, 4)).mean())
    # 0    0.4
    # 1    0.8
    # 2    1.2
    # 3    1.6

# checking the function:
prop_above_and_below(pd.Series((1, 2, 3, 4)))
    # dtype: float64
    # prop(w < 0.1)      0.00
    # prop(w < 0.2)      0.00
    # prop(w < 0.333)    0.00
    # prop(w < 0.5)      0.25
    # prop(w < 1.0)      0.50
    # prop(w >= 1)       0.50
    # prop(w >= 2)       0.00
    # prop(w >= 3)       0.00
    # prop(w >= 5)       0.00
    # prop(w >= 10)      0.00
    # dtype: float64

prop_above_and_below(pd.Series((1, 2, 3, 4)), below = (0.1, 0.5), above = (2,3))
    # prop(w < 0.1)    0.00
    # prop(w < 0.5)    0.25
    # prop(w >= 2)     0.00
    # prop(w >= 3)     0.00
    # dtype: float64

prop_above_and_below(pd.Series((1, 2, 3, 4)), return_as_series = False)
    # {'below': prop(w < 0.1)      0.00
    # prop(w < 0.2)      0.00
    # prop(w < 0.333)    0.00
    # prop(w < 0.5)      0.25
    # prop(w < 1)        0.50
    # dtype: float64, 'above': prop(w >= 1)     0.5
    # prop(w >= 2)     0.0
    # prop(w >= 3)     0.0
    # prop(w >= 5)     0.0
    # prop(w >= 10)    0.0
    # dtype: float64}
balance.stats_and_plots.weights_stats.weighted_median_breakdown_point(w: Series) float64[source]

Calculates the minimal percent of users that have at least 50% of the weights. This gives us the breakdown point of calculating the weighted median. This can be thought of as reflecting a similar metric to the design effect. See also: - https://en.wikipedia.org/wiki/Weighted_median - https://en.wikipedia.org/wiki/Robust_statistics#Breakdown_point

Parameters:

w (pd.Series) – A pandas series of weights (float, non negative values).

Returns:

A minimal percent of users that contain at least 50% of the weights.

Return type:

np.float64

Examples

w = pd.Series([1,1,1,1])
print(weighted_median_breakdown_point(w)) # 0.5

w = pd.Series([2,2,2,2])
print(weighted_median_breakdown_point(w)) # 0.5

w = pd.Series([1,1,1, 10])
print(weighted_median_breakdown_point(w)) # 0.25

w = pd.Series([1,1,1,1, 10])
print(weighted_median_breakdown_point(w)) # 0.2