balance.stats_and_plots.weights_stats

class balance.stats_and_plots.weights_stats.KishStats(deff: float, ess: float, essp: float)[source]

Kish design-effect diagnostic bundle.

All three members share a single design_effect computation. The identities are:

  • deff = E[w^2] / E[w]^2 (>= 1 for non-degenerate weights)

  • ess  = n / deff (effective sample size)

  • essp = 1 / deff (effective sample proportion in [0, 1])

deff

Kish’s design effect.

Type:

float

ess

Kish’s effective sample size.

Type:

float

essp

Kish’s effective sample proportion.

Type:

float

deff: float

Alias for field number 0

ess: float

Alias for field number 1

essp: float

Alias for field number 2

class balance.stats_and_plots.weights_stats.PropAboveBelowResult[source]
balance.stats_and_plots.weights_stats.design_effect(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float64[source]

Kish’s design effect measure.

The design effect is a number that shows how well a sample of people may represent a larger group of people for a specific measure of interest (such as the mean). Kish’s design effect gives the increase in the variance of the weighted mean based on “haphazard” weights.

The inverse of the design effect is the effective sample size ratio.

Design effect in general can be lower than 1 for stratified sampling. However, when calculating Kish’s design effect for weights the design effect is always 1 or larger.

For details, see: Tal Galili (5 May 2024). “Design effect”. WikiJournal of Science 7 (1): 4. doi:10.15347/wjs/2024.004. Wikidata Q116768211. ISSN 2470-6345. https://en.wikipedia.org/wiki/Design_effect

Parameters:

w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If w is a DataFrame, only the first column is used.

Returns:

An estimator saying by how much the variance of the mean is expected to increase, compared to a random sample mean, due to application of the weights.

Return type:

np.float64

Examples

from balance.stats_and_plots.weights_stats import design_effect
import pandas as pd

design_effect(pd.Series((0, 1, 2, 3)))
    # output:
    # 1.5555555555555556
design_effect(pd.Series((1, 1, 1000)))
    # 2.9880418803112336
    # As expected. With a single dominating weight - the Deff is almost equal to the sample size.
balance.stats_and_plots.weights_stats.kish_deff_stats(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) KishStats[source]

Bundle Kish’s design effect, effective sample size, and ESS proportion.

Computes design_effect once and derives ess and essp from it, avoiding three separate Deff computations when all three quantities are needed. Use kish_ess / kish_essp only when you need exactly one number.

Parameters:

w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If w is a DataFrame, only the first column is used.

Returns:

Namedtuple with deff, ess, and essp fields.

Return type:

KishStats

Raises:
  • TypeError – If w is not numeric.

  • ValueError – If w is empty, contains negative entries, or contains no positive entries (validated through design_effect with require_positive=True).

Examples

from balance.stats_and_plots.weights_stats import kish_deff_stats
import pandas as pd

stats = kish_deff_stats(pd.Series((1, 1, 1, 1)))
stats.deff   # 1.0
stats.ess    # 4.0
stats.essp   # 1.0
balance.stats_and_plots.weights_stats.kish_ess(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float[source]

Kish’s effective sample size: n / Deff = sum(w)^2 / sum(w^2).

Convenience singleton over design_effect. Prefer kish_deff_stats when you need ess alongside deff or essp — that path computes design_effect once and derives all three.

Parameters:

w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If w is a DataFrame, only the first column is used.

Returns:

ESS in the same units as len(w). Equals len(w) when all weights are equal.

Return type:

float

Raises:
  • TypeError – If w is not numeric.

  • ValueError – If w is empty, contains negative entries, or contains no positive entries.

Examples

from balance.stats_and_plots.weights_stats import kish_ess
import pandas as pd

kish_ess(pd.Series((1, 1, 1, 1)))   # 4.0
balance.stats_and_plots.weights_stats.kish_essp(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float[source]

Kish’s effective sample proportion: 1 / Deff (always in [0, 1]).

Convenience singleton over design_effect. Prefer kish_deff_stats when you need essp alongside deff or ess.

Parameters:

w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If w is a DataFrame, only the first column is used.

Returns:

A value in [0, 1]. Equals 1.0 when all weights are equal.

Return type:

float

Raises:
  • TypeError – If w is not numeric.

  • ValueError – If w is empty, contains negative entries, or contains no positive entries.

Examples

from balance.stats_and_plots.weights_stats import kish_essp
import pandas as pd

kish_essp(pd.Series((1, 1, 1, 1)))   # 1.0
balance.stats_and_plots.weights_stats.nonparametric_skew(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float[source]

The nonparametric skew is the difference between the mean and the median, divided by the standard deviation. See: - https://en.wikipedia.org/wiki/Nonparametric_skew

Parameters:

w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If w is a DataFrame, only the first column is used.

Returns:

A value of skew, between -1 to 1, but for weights it’s often positive (i.e.: right tailed distribution). The value returned will be 0 if the standard deviation is 0 (i.e.: all values are identical), or if the input is of length 1.

Return type:

np.float64

Examples

from balance.stats_and_plots.weights_stats import nonparametric_skew

nonparametric_skew(pd.Series((1, 1, 1, 1)))  # 0
nonparametric_skew(pd.Series((1)))           # 0
nonparametric_skew(pd.Series((1, 2, 3, 4)))  # 0
nonparametric_skew(pd.Series((1, 1, 1, 2)))  # 0.5
nonparametric_skew(pd.Series((-1,1,1, 1)))   #-0.5
balance.stats_and_plots.weights_stats.prop_above_and_below(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame, below: tuple[float, ...] | list[float] | None = (1 / 10, 1 / 5, 1 / 3, 1 / 2, 1), above: tuple[float, ...] | list[float] | None = (1, 2, 3, 5, 10), return_as_series: Literal[True] = True) Series | None[source]
balance.stats_and_plots.weights_stats.prop_above_and_below(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame, below: tuple[float, ...] | list[float] | None = (1 / 10, 1 / 5, 1 / 3, 1 / 2, 1), above: tuple[float, ...] | list[float] | None = (1, 2, 3, 5, 10), *, return_as_series: Literal[False]) PropAboveBelowResult | None

The proportion of weights, normalized to sample size, that are above and below some numbers (E.g. 1,2,3,5,10 and their inverse: 1, 1/2, 1/3, etc.). This is similar to returning percentiles of the (normalized) weighted distribution. But instead of focusing on the 25th percentile, the median, etc, We focus instead on more easily interpretable weights values.

For example, saying that some proportion of users had a weight of above 1 gives us an indication of how many users we got that we don’t “loose” their value after using the weights. Saying which proportion of users had a weight below 1/10 tells us how many users had basically almost no contribution to the final analysis (after applying the weights).

Note that below and above can overlap, be unordered, etc. The user is responsible for the order.

Parameters:
  • w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If w is a DataFrame, only the first column is used.

  • below (tuple[float, ...] | list[float] | None, optional) – values to check which proportion of normalized weights are below them. Using None omits below-threshold calculations. Defaults to (1/10, 1/5, 1/3, 1/2, 1).

  • above (tuple[float, ...] | list[float] | None, optional) – values to check which proportion of normalized weights are above (or equal) to them. Using None omits above-threshold calculations. Defaults to (1, 2, 3, 5, 10).

  • return_as_series (bool, optional) – If true returns one pd.Series of values. If False returns PropAboveBelowResult with below/above entries containing a pd.Series or None for omitted groups. Defaults to True.

Returns:

If return_as_series is True we get pd.Series with proportions of (normalized weights) that are below/above some numbers, the index indicates which threshold was checked (the values in the index are rounded up to 3 points for printing purposes). If return_as_series is False we get PropAboveBelowResult with below and above keys whose values are the relevant pd.Series (or None when a side is omitted). If both below and above are None, the function returns None.

Return type:

pd.Series | PropAboveBelowResult | None

Examples

from balance.stats_and_plots.weights_stats import prop_above_and_below
import pandas as pd

# normalized weights:
print(pd.Series((1, 2, 3, 4)) / pd.Series((1, 2, 3, 4)).mean())
    # 0    0.4
    # 1    0.8
    # 2    1.2
    # 3    1.6

# checking the function:
prop_above_and_below(pd.Series((1, 2, 3, 4)))
    # dtype: float64
    # prop(w < 0.1)      0.00
    # prop(w < 0.2)      0.00
    # prop(w < 0.333)    0.00
    # prop(w < 0.5)      0.25
    # prop(w < 1.0)      0.50
    # prop(w >= 1)       0.50
    # prop(w >= 2)       0.00
    # prop(w >= 3)       0.00
    # prop(w >= 5)       0.00
    # prop(w >= 10)      0.00
    # dtype: float64

prop_above_and_below(pd.Series((1, 2, 3, 4)), below = (0.1, 0.5), above = (2,3))
    # prop(w < 0.1)    0.00
    # prop(w < 0.5)    0.25
    # prop(w >= 2)     0.00
    # prop(w >= 3)     0.00
    # dtype: float64

prop_above_and_below(pd.Series((1, 2, 3, 4)), return_as_series = False)
    # {'below': prop(w < 0.1)      0.00
    # prop(w < 0.2)      0.00
    # prop(w < 0.333)    0.00
    # prop(w < 0.5)      0.25
    # prop(w < 1)        0.50
    # dtype: float64, 'above': prop(w >= 1)     0.5
    # prop(w >= 2)     0.0
    # prop(w >= 3)     0.0
    # prop(w >= 5)     0.0
    # prop(w >= 10)    0.0
    # dtype: float64}
balance.stats_and_plots.weights_stats.weighted_median_breakdown_point(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float64[source]

Calculates the minimal percent of users that have at least 50% of the weights. This gives us the breakdown point of calculating the weighted median. This can be thought of as reflecting a similar metric to the design effect. See also: - https://en.wikipedia.org/wiki/Weighted_median - https://en.wikipedia.org/wiki/Robust_statistics#Breakdown_point

Parameters:

w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If w is a DataFrame, only the first column is used.

Returns:

A minimal percent of users that contain at least 50% of the weights.

Return type:

np.float64

Examples

w = pd.Series([1,1,1,1])
print(weighted_median_breakdown_point(w)) # 0.5

w = pd.Series([2,2,2,2])
print(weighted_median_breakdown_point(w)) # 0.5

w = pd.Series([1,1,1, 10])
print(weighted_median_breakdown_point(w)) # 0.25

w = pd.Series([1,1,1,1, 10])
print(weighted_median_breakdown_point(w)) # 0.2