balance.stats_and_plots.weighted_stats¶
- balance.stats_and_plots.weighted_stats.ci_of_weighted_mean(v: list[float] | pd.Series | pd.DataFrame | npt.NDArray, w: list[float] | pd.Series | npt.NDArray | None = None, conf_level: float = 0.95, round_ndigits: int | None = None, inf_rm: bool = False) pd.Series[source]¶
Computes the confidence interval of the weighted mean of a list of values and their corresponding weights.
If no weights are supplied, it assumes that all values have equal weights of 1.0.
v (Union[List[float], pd.Series, pd.DataFrame, np.ndarray]): A series of values. If v is a DataFrame, the weighted mean and its confidence interval will be calculated for each column using the same set of weights from w. w (Optional[Union[List[float], pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. conf_level (float, optional): Confidence level for the interval, between 0 and 1. Defaults to 0.95. round_ndigits (Optional[int], optional): Number of decimal places to round the confidence interval. If None, the values will not be rounded. Defaults to None. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The confidence interval of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the confidence interval of each column. The values are of data type Tuple[np.float64, np.float64]. If inf_rm is False:
If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.
- Return type:
pd.Series
Examples: .. code-block:: python
from balance.stats_and_plots.weighted_stats import ci_of_weighted_mean
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)))
# 0 (1.404346824279273, 3.5956531757207273) # dtype: object
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()
# [(1.404, 3.596)]
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# 0 (2.039817664728938, 3.960182335271062) # dtype: object
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()
# [(2.04, 3.96)]
- df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}
) w = pd.Series((1, 2, 3, 4)) ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3)
# a (1.738, 4.262) # b (1.0, 1.0) # dtype: object
- ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3).to_dict()
# {‘a’: (1.738, 4.262), ‘b’: (1.0, 1.0)}
- balance.stats_and_plots.weighted_stats.descriptive_stats(df: pd.DataFrame, weights: list[Any] | pd.Series | npt.NDArray | None = None, stat: str = 'mean', weighted: bool = True, numeric_only: bool = False, add_na: bool = True, formula: str | list[str] | None = None, **kwargs: Any) pd.DataFrame[source]¶
Computes weighted statistics (e.g.: mean, std) on a DataFrame
This function gets a DataFrame + weights and apply some weighted aggregation function (mean, std, or DescrStatsW). The main benefit of the function is that if the DataFrame includes non-numeric columns, then descriptive_stats will first run
model_matrix()to create some numeric dummary variable that will then be processed.- Parameters:
df (pd.DataFrame) – Some DataFrame to get stats (mean, std, etc.) for.
weights (Union[ List, pd.Series, np.ndarray, ], optional) – Weights to apply for the computation. Defaults to None.
stat (str, optional) –
Which statistic to calculate on the data. If mean - uses
weighted_mean()(with inf_rm=True) If std - usesweighted_sd()(with inf_rm=True) If var_of_mean - usesvar_of_weighted_mean()(with inf_rm=True) If ci_of_mean - usesci_of_weighted_mean()(with inf_rm=True) If something else - tries to usestatsmodels.stats.weightstats.DescrStatsW().This supports stat such as: std_mean, sum_weights, nobs, etc. See function documentation to see more. (while removing mutual nan using
rm_mutual_nas())Defaults to “mean”.
weighted (bool, optional) – If stat is not “mean” or “std”, if to use the weights with the DescrStatsW function. Defaults to True.
numeric_only (bool, optional) – Should the statistics be computed only on numeric columns? If True - then non-numeric columns will be omitted. If False - then
model_matrix()(with no formula argument) will be used to transfer non-numeric columns to dummy variables. Defaults to False.add_na (bool, optional) – Passed to
model_matrix(). Relevant only if numeric_only == False and df has non-numeric columns. Defaults to True.formula (Optional[Union[str, List[str]]], optional) – Formula passed to
model_matrix(). When provided, the formula is always applied, allowing customization of which columns/dummies are used in the statistics (and taking precedence over numeric_only). Defaults to None.**kwargs – extra args to be passed to functions (e.g.: ci_of_weighted_mean)
- Returns:
Returns pd.DataFrame of the output (based on stat argument), for each of the columns in df.
- Return type:
pd.DataFrame
Examples: .. code-block:: python
import numpy as np import pandas as pd from balance.stats_and_plots.weighted_stats import descriptive_stats, weighted_mean, weighted_sd
# Without weights x = [1, 2, 3, 4] print(descriptive_stats(pd.DataFrame(x), stat=”mean”)) print(np.mean(x)) print(weighted_mean(x))
# 0 # 0 2.5 # 2.5 # 0 2.5 # dtype: float64
- print(descriptive_stats(pd.DataFrame(x), stat=”var_of_mean”))
# 0 # 0 0.3125
print(descriptive_stats(pd.DataFrame(x), stat=”std”)) print(weighted_sd(x)) x2 = pd.Series(x) print(np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)-1) ))
# 0 # 0 1.290994 # 0 1.290994 # dtype: float64 # 1.2909944487358056 # Notice that it is different from # print(np.std(x)) # which gives: 1.118033988749895 # Which is the MLE (i.e.: biased, dividing by n and not n-1) estimator for std: # (np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)) ))
x2 = pd.Series(x) tmp_sd = np.sqrt(np.sum((x2 - x2.mean()) ** 2) / (len(x) - 1)) tmp_se = tmp_sd / np.sqrt(len(x)) print(descriptive_stats(pd.DataFrame(x), stat=”std_mean”).iloc[0, 0]) print(tmp_se)
# 0.6454972243679029 # 0.6454972243679028
# Weighted results x, w = [1, 2, 3, 4], [1, 2, 3, 4] print(descriptive_stats(pd.DataFrame(x), w, stat=”mean”))
# 0 # 0 3.0
- print(descriptive_stats(pd.DataFrame(x), w, stat=”std”))
# 0 # 0 1.195229
- print(descriptive_stats(pd.DataFrame(x), w, stat=”std_mean”))
# 0 # 0 0.333333
- print(descriptive_stats(pd.DataFrame(x), w, stat=”var_of_mean”))
# 0 # 0 0.24
- print(descriptive_stats(pd.DataFrame(x), w, stat=”ci_of_mean”, conf_level = 0.99, round_ndigits=3))
# 0 # 0 (1.738, 4.262)
# Formula examples df = pd.DataFrame({“num”: [1, 2, 3], “group”: [“a”, “b”, “a”]}) # Default formula includes all variables and encodes categoricals. print(descriptive_stats(df, stat=”mean”))
# group[a] group[b] num # 0 0.67 0.33 2.0
- print(descriptive_stats(df, stat=”mean”, formula=”num”))
# num # 0 2.0
- print(descriptive_stats(df, stat=”mean”, formula=”group”))
# group[a] group[b] # 0 0.67 0.33
- print(descriptive_stats(df, stat=”mean”, formula=”num + group”))
# group[a] group[b] num # 0 0.67 0.33 2.0
- print(descriptive_stats(df, stat=”mean”, formula=”num + group + num:group”))
# group[a] group[b] num num:group[T.b] # 0 0.67 0.33 2.0 0.67
- balance.stats_and_plots.weighted_stats.relative_frequency_table(df: pd.DataFrame | pd.Series, column: str | None = None, w: pd.Series | None = None) pd.DataFrame[source]¶
Creates a relative frequency table by aggregating over a categorical column (column) - optionally weighted by w. I.e.: produce the proportion (or weighted proportion) of rows that appear in each category, relative to the total number of rows (or sum of weights). See: https://en.wikipedia.org/wiki/Frequency_(statistics)#Types.
Used in plotting functions.
- Parameters:
df (pd.DataFrame) – A DataFrame with categorical columns, or a pd.Series of the grouping column.
column (Optional[str]) – The name of the column to be aggregated. If None (default), then it takes the first column of df (if pd.DataFrame), or just uses as is (if pd.Series)
w (Optional[pd.Series], optional) – Optional weights to use when aggregating the relative proportions. If None than assumes weights is 1 for all rows. The defaults is None.
- Returns:
- a pd.DataFrame with columns:
column, the aggregation variable, and,
’prop’, the aggregated (weighted) proportion of rows from each group in ‘column’.
- Return type:
pd.DataFrame
Examples: .. code-block:: python
from balance.stats_and_plots.weighted_stats import relative_frequency_table import pandas as pd
- df = pd.DataFrame({
‘group’: (‘a’, ‘b’, ‘c’, ‘c’), ‘v1’: (1, 2, 3, 4),
}) print(relative_frequency_table(df, ‘group’))
# group prop # 0 a 0.25 # 1 b 0.25 # 2 c 0.50
- print(relative_frequency_table(df, ‘group’, pd.Series((2, 1, 1, 1),)))
# group prop # 0 a 0.4 # 1 b 0.2 # 2 c 0.4
# Using a pd.Series: a_series = df[‘group’] print(relative_frequency_table(a_series))
# group prop # 0 a 0.25 # 1 b 0.25 # 2 c 0.50
- balance.stats_and_plots.weighted_stats.var_of_weighted_mean(v: list[Any] | pd.Series | pd.DataFrame | np.matrix, w: list[Any] | pd.Series | npt.NDArray | None = None, inf_rm: bool = False) pd.Series[source]¶
Computes the variance of the weighted average (pi estimator for ratio-mean) of a list of values and their corresponding weights.
If no weights are supplied, it assumes that all values have equal weights of 1.0.
Uses
_prepare_weighted_stat_args().v (Union[List, pd.Series, pd.DataFrame, np.matrix]): A series of values. If v is a DataFrame, the weighted variance will be calculated for each column using the same set of weights from w. w (Optional[Union[List, pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The variance of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the weighted variance of each column. The values are of data type np.float64. If inf_rm is False:
If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.
- Return type:
pd.Series
Examples: .. code-block:: python
from balance.stats_and_plots.weighted_stats import var_of_weighted_mean
# In R: sum((1:4 - mean(1:4))^2 / 4) / (4) # [1] 0.3125 var_of_weighted_mean(pd.Series((1, 2, 3, 4)))
# pd.Series(0.3125)
# For a reproducible R example, see: https://gist.github.com/talgalili/b92cd8cdcbfc287e331a8f27db265c00 var_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# pd.Series(0.24)
- df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}
) w = pd.Series((1, 2, 3, 4)) var_of_weighted_mean(df, w)
# a 0.24 # b 0.00 # dtype: float64
- balance.stats_and_plots.weighted_stats.weighted_mean(v: list[Any] | pd.Series | pd.DataFrame | np.matrix, w: list[Any] | pd.Series | npt.NDArray | None = None, inf_rm: bool = False) pd.Series[source]¶
Computes the weighted average of a pandas Series or DataFrame.
If no weights are supplied, it just computes the simple arithmetic mean.
See: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
Uses
_prepare_weighted_stat_args().- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to average. None (or np.nan) values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.
w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The weighted mean. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:
If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.
- Return type:
pd.Series(dtype=np.float64)
Examples: .. code-block:: python
from balance.stats_and_plots.weighted_stats import weighted_mean
- weighted_mean(pd.Series((1, 2, 3, 4)))
# 0 2.5 # dtype: float64
- weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# 0 3.0 # dtype: float64
- df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}
) w = pd.Series((1, 2, 3, 4)) weighted_mean(df, w)
# a 3.0 # b 1.0 # dtype: float64
- balance.stats_and_plots.weighted_stats.weighted_quantile(v: list[Any] | pd.Series | pd.DataFrame | npt.NDArray | np.matrix, quantiles: list[Any] | pd.Series | npt.NDArray, w: list[Any] | pd.Series | npt.NDArray | None = None, inf_rm: bool = False) pd.DataFrame[source]¶
Calculates the weighted quantiles (q) of values (v) based on weights (w).
See
_prepare_weighted_stat_args()for the pre-processing done to v and w.Based on
statsmodels.stats.weightstats.DescrStatsW().- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.array, np.matrix, ]) – values to get the weighted quantiles for.
quantiles (Union[ List, pd.Series, ]) – the quantiles to calculate.
w (Union[ List, pd.Series, np.array, ] optional) – weights. Defaults to None.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
- The index (names p) has the values from quantiles. The columns are based on v:
If it’s a pd.Series it’s one column, if it’s a pd.DataFrame with several columns, than each column in the output corrosponds to the column in v.
- Return type:
pd.DataFrame
Examples: .. code-block:: python
from balance.stats_and_plots.weighted_stats import weighted_quantile weighted_quantile([1, 2, 3], [0.5], w=[1, 1, 1]).iloc[0, 0] # 2.0
- balance.stats_and_plots.weighted_stats.weighted_sd(v: list[Any] | pd.Series | pd.DataFrame | np.matrix, w: list[Any] | pd.Series | npt.NDArray | None = None, inf_rm: bool = False) pd.Series[source]¶
Calculate the sample weighted standard deviation
See
weighted_var()for details.- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – Values.
w (Union[ List, pd.Series, np.ndarray, None, ], optional) – Weights. Defaults to None.
inf_rm (bool, optional) – Remove inf. Defaults to False.
- Returns:
np.sqrt of
weighted_var()(np.float64)- Return type:
pd.Series
Examples: .. code-block:: python
import pandas as pd from balance.stats_and_plots.weighted_stats import weighted_sd weighted_sd(pd.DataFrame({“x”: [1, 2, 3]}), [1, 1, 1]).tolist() # [1.0]
- balance.stats_and_plots.weighted_stats.weighted_var(v: list[Any] | pd.Series | pd.DataFrame | npt.NDArray | np.matrix, w: list[Any] | pd.Series | npt.NDArray | None = None, inf_rm: bool = False) pd.Series[source]¶
Calculate the sample weighted variance (a.k.a ‘reliability weights’). This is described here: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights_2 And also used in SDMTools, see: https://www.gnu.org/software/gsl/doc/html/statistics.html#weighted-samples
Uses
weighted_mean()and_prepare_weighted_stat_args().- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to get the weighted variance for. None values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.
w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The weighted variance. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:
If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.
- Return type:
pd.Series[np.float64]
Examples: .. code-block:: python
import pandas as pd from balance.stats_and_plots.weighted_stats import weighted_var weighted_var(pd.DataFrame({“x”: [1, 2, 3]}), [1, 1, 1]).tolist() # [1.0]