balance.stats_and_plots.weights_stats¶
- class balance.stats_and_plots.weights_stats.KishStats(deff: float, ess: float, essp: float)[source]¶
Kish design-effect diagnostic bundle.
All three members share a single
design_effectcomputation. The identities are:deff = E[w^2] / E[w]^2(>= 1 for non-degenerate weights)ess = n / deff(effective sample size)essp = 1 / deff(effective sample proportion in[0, 1])
- deff¶
Kish’s design effect.
- Type:
float
- ess¶
Kish’s effective sample size.
- Type:
float
- essp¶
Kish’s effective sample proportion.
- Type:
float
- deff: float¶
Alias for field number 0
- ess: float¶
Alias for field number 1
- essp: float¶
Alias for field number 2
- balance.stats_and_plots.weights_stats.design_effect(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float64[source]¶
Kish’s design effect measure.
The design effect is a number that shows how well a sample of people may represent a larger group of people for a specific measure of interest (such as the mean). Kish’s design effect gives the increase in the variance of the weighted mean based on “haphazard” weights.
The inverse of the design effect is the effective sample size ratio.
Design effect in general can be lower than 1 for stratified sampling. However, when calculating Kish’s design effect for weights the design effect is always 1 or larger.
For details, see: Tal Galili (5 May 2024). “Design effect”. WikiJournal of Science 7 (1): 4. doi:10.15347/wjs/2024.004. Wikidata Q116768211. ISSN 2470-6345. https://en.wikipedia.org/wiki/Design_effect
- Parameters:
w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If
wis a DataFrame, only the first column is used.- Returns:
An estimator saying by how much the variance of the mean is expected to increase, compared to a random sample mean, due to application of the weights.
- Return type:
np.float64
Examples
from balance.stats_and_plots.weights_stats import design_effect import pandas as pd design_effect(pd.Series((0, 1, 2, 3))) # output: # 1.5555555555555556 design_effect(pd.Series((1, 1, 1000))) # 2.9880418803112336 # As expected. With a single dominating weight - the Deff is almost equal to the sample size.
- balance.stats_and_plots.weights_stats.kish_deff_stats(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) KishStats[source]¶
Bundle Kish’s design effect, effective sample size, and ESS proportion.
Computes
design_effectonce and derivesessandesspfrom it, avoiding three separate Deff computations when all three quantities are needed. Usekish_ess/kish_essponly when you need exactly one number.- Parameters:
w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If
wis a DataFrame, only the first column is used.- Returns:
Namedtuple with
deff,ess, andesspfields.- Return type:
- Raises:
TypeError – If
wis not numeric.ValueError – If
wis empty, contains negative entries, or contains no positive entries (validated throughdesign_effectwithrequire_positive=True).
Examples
from balance.stats_and_plots.weights_stats import kish_deff_stats import pandas as pd stats = kish_deff_stats(pd.Series((1, 1, 1, 1))) stats.deff # 1.0 stats.ess # 4.0 stats.essp # 1.0
- balance.stats_and_plots.weights_stats.kish_ess(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float[source]¶
Kish’s effective sample size:
n / Deff = sum(w)^2 / sum(w^2).Convenience singleton over
design_effect. Preferkish_deff_statswhen you needessalongsidedefforessp— that path computesdesign_effectonce and derives all three.- Parameters:
w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If
wis a DataFrame, only the first column is used.- Returns:
ESS in the same units as
len(w). Equalslen(w)when all weights are equal.- Return type:
float
- Raises:
TypeError – If
wis not numeric.ValueError – If
wis empty, contains negative entries, or contains no positive entries.
Examples
from balance.stats_and_plots.weights_stats import kish_ess import pandas as pd kish_ess(pd.Series((1, 1, 1, 1))) # 4.0
- balance.stats_and_plots.weights_stats.kish_essp(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float[source]¶
Kish’s effective sample proportion:
1 / Deff(always in[0, 1]).Convenience singleton over
design_effect. Preferkish_deff_statswhen you needesspalongsidedefforess.- Parameters:
w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If
wis a DataFrame, only the first column is used.- Returns:
A value in
[0, 1]. Equals1.0when all weights are equal.- Return type:
float
- Raises:
TypeError – If
wis not numeric.ValueError – If
wis empty, contains negative entries, or contains no positive entries.
Examples
from balance.stats_and_plots.weights_stats import kish_essp import pandas as pd kish_essp(pd.Series((1, 1, 1, 1))) # 1.0
- balance.stats_and_plots.weights_stats.nonparametric_skew(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float[source]¶
The nonparametric skew is the difference between the mean and the median, divided by the standard deviation. See: - https://en.wikipedia.org/wiki/Nonparametric_skew
- Parameters:
w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If
wis a DataFrame, only the first column is used.- Returns:
A value of skew, between -1 to 1, but for weights it’s often positive (i.e.: right tailed distribution). The value returned will be 0 if the standard deviation is 0 (i.e.: all values are identical), or if the input is of length 1.
- Return type:
np.float64
Examples
from balance.stats_and_plots.weights_stats import nonparametric_skew nonparametric_skew(pd.Series((1, 1, 1, 1))) # 0 nonparametric_skew(pd.Series((1))) # 0 nonparametric_skew(pd.Series((1, 2, 3, 4))) # 0 nonparametric_skew(pd.Series((1, 1, 1, 2))) # 0.5 nonparametric_skew(pd.Series((-1,1,1, 1))) #-0.5
- balance.stats_and_plots.weights_stats.prop_above_and_below(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame, below: tuple[float, ...] | list[float] | None = (1 / 10, 1 / 5, 1 / 3, 1 / 2, 1), above: tuple[float, ...] | list[float] | None = (1, 2, 3, 5, 10), return_as_series: Literal[True] = True) Series | None[source]¶
- balance.stats_and_plots.weights_stats.prop_above_and_below(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame, below: tuple[float, ...] | list[float] | None = (1 / 10, 1 / 5, 1 / 3, 1 / 2, 1), above: tuple[float, ...] | list[float] | None = (1, 2, 3, 5, 10), *, return_as_series: Literal[False]) PropAboveBelowResult | None
The proportion of weights, normalized to sample size, that are above and below some numbers (E.g. 1,2,3,5,10 and their inverse: 1, 1/2, 1/3, etc.). This is similar to returning percentiles of the (normalized) weighted distribution. But instead of focusing on the 25th percentile, the median, etc, We focus instead on more easily interpretable weights values.
For example, saying that some proportion of users had a weight of above 1 gives us an indication of how many users we got that we don’t “loose” their value after using the weights. Saying which proportion of users had a weight below 1/10 tells us how many users had basically almost no contribution to the final analysis (after applying the weights).
Note that below and above can overlap, be unordered, etc. The user is responsible for the order.
- Parameters:
w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If
wis a DataFrame, only the first column is used.below (tuple[float, ...] | list[float] | None, optional) – values to check which proportion of normalized weights are below them. Using None omits below-threshold calculations. Defaults to (1/10, 1/5, 1/3, 1/2, 1).
above (tuple[float, ...] | list[float] | None, optional) – values to check which proportion of normalized weights are above (or equal) to them. Using None omits above-threshold calculations. Defaults to (1, 2, 3, 5, 10).
return_as_series (bool, optional) – If true returns one pd.Series of values. If False returns
PropAboveBelowResultwithbelow/aboveentries containing apd.SeriesorNonefor omitted groups. Defaults to True.
- Returns:
If return_as_series is True we get pd.Series with proportions of (normalized weights) that are below/above some numbers, the index indicates which threshold was checked (the values in the index are rounded up to 3 points for printing purposes). If return_as_series is False we get
PropAboveBelowResultwithbelowandabovekeys whose values are the relevant pd.Series (orNonewhen a side is omitted). If bothbelowandaboveareNone, the function returnsNone.- Return type:
pd.Series | PropAboveBelowResult | None
Examples
from balance.stats_and_plots.weights_stats import prop_above_and_below import pandas as pd # normalized weights: print(pd.Series((1, 2, 3, 4)) / pd.Series((1, 2, 3, 4)).mean()) # 0 0.4 # 1 0.8 # 2 1.2 # 3 1.6 # checking the function: prop_above_and_below(pd.Series((1, 2, 3, 4))) # dtype: float64 # prop(w < 0.1) 0.00 # prop(w < 0.2) 0.00 # prop(w < 0.333) 0.00 # prop(w < 0.5) 0.25 # prop(w < 1.0) 0.50 # prop(w >= 1) 0.50 # prop(w >= 2) 0.00 # prop(w >= 3) 0.00 # prop(w >= 5) 0.00 # prop(w >= 10) 0.00 # dtype: float64 prop_above_and_below(pd.Series((1, 2, 3, 4)), below = (0.1, 0.5), above = (2,3)) # prop(w < 0.1) 0.00 # prop(w < 0.5) 0.25 # prop(w >= 2) 0.00 # prop(w >= 3) 0.00 # dtype: float64 prop_above_and_below(pd.Series((1, 2, 3, 4)), return_as_series = False) # {'below': prop(w < 0.1) 0.00 # prop(w < 0.2) 0.00 # prop(w < 0.333) 0.00 # prop(w < 0.5) 0.25 # prop(w < 1) 0.50 # dtype: float64, 'above': prop(w >= 1) 0.5 # prop(w >= 2) 0.0 # prop(w >= 3) 0.0 # prop(w >= 5) 0.0 # prop(w >= 10) 0.0 # dtype: float64}
- balance.stats_and_plots.weights_stats.weighted_median_breakdown_point(w: list[Any] | Series | ndarray[tuple[Any, ...], dtype[_ScalarT]] | DataFrame) float64[source]¶
Calculates the minimal percent of users that have at least 50% of the weights. This gives us the breakdown point of calculating the weighted median. This can be thought of as reflecting a similar metric to the design effect. See also: - https://en.wikipedia.org/wiki/Weighted_median - https://en.wikipedia.org/wiki/Robust_statistics#Breakdown_point
- Parameters:
w (list[Any] | pd.Series | npt.NDArray | pd.DataFrame) – Weights container with non-negative numeric values. If
wis a DataFrame, only the first column is used.- Returns:
A minimal percent of users that contain at least 50% of the weights.
- Return type:
np.float64
Examples
w = pd.Series([1,1,1,1]) print(weighted_median_breakdown_point(w)) # 0.5 w = pd.Series([2,2,2,2]) print(weighted_median_breakdown_point(w)) # 0.5 w = pd.Series([1,1,1, 10]) print(weighted_median_breakdown_point(w)) # 0.25 w = pd.Series([1,1,1,1, 10]) print(weighted_median_breakdown_point(w)) # 0.2