balance.stats_and_plots.weighted_stats¶
- balance.stats_and_plots.weighted_stats.ci_of_weighted_mean(v: List[float] | Series | DataFrame | ndarray[Any, dtype[_ScalarType_co]], w: List[float] | Series | ndarray[Any, dtype[_ScalarType_co]] | None = None, conf_level: float = 0.95, round_ndigits: int | None = None, inf_rm: bool = False) Series [source]¶
Computes the confidence interval of the weighted mean of a list of values and their corresponding weights.
If no weights are supplied, it assumes that all values have equal weights of 1.0.
v (Union[List[float], pd.Series, pd.DataFrame, np.ndarray]): A series of values. If v is a DataFrame, the weighted mean and its confidence interval will be calculated for each column using the same set of weights from w. w (Optional[Union[List[float], pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. conf_level (float, optional): Confidence level for the interval, between 0 and 1. Defaults to 0.95. round_ndigits (Optional[int], optional): Number of decimal places to round the confidence interval. If None, the values will not be rounded. Defaults to None. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The confidence interval of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the confidence interval of each column. The values are of data type Tuple[np.float64, np.float64]. If inf_rm is False:
If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.
- Return type:
pd.Series
Examples
- ::
from balance.stats_and_plots.weighted_stats import ci_of_weighted_mean
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)))
# 0 (1.404346824279273, 3.5956531757207273) # dtype: object
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()
# [(1.404, 3.596)]
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# 0 (2.039817664728938, 3.960182335271062) # dtype: object
- ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()
# [(2.04, 3.96)]
- df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}
) w = pd.Series((1, 2, 3, 4)) ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3)
# a (1.738, 4.262) # b (1.0, 1.0) # dtype: object
- ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3).to_dict()
# {‘a’: (1.738, 4.262), ‘b’: (1.0, 1.0)}
- balance.stats_and_plots.weighted_stats.descriptive_stats(df: DataFrame, weights: List | Series | ndarray[Any, dtype[_ScalarType_co]] | None = None, stat: Literal['mean', 'std', 'var_of_mean', 'ci_of_mean', '...'] = 'mean', weighted: bool = True, numeric_only: bool = False, add_na: bool = True, **kwargs) DataFrame [source]¶
Computes weighted statistics (e.g.: mean, std) on a DataFrame
This function gets a DataFrame + weights and apply some weighted aggregation function (mean, std, or DescrStatsW). The main benefit of the function is that if the DataFrame includes non-numeric columns, then descriptive_stats will first run
model_matrix()
to create some numeric dummary variable that will then be processed.- Parameters:
df (pd.DataFrame) – Some DataFrame to get stats (mean, std, etc.) for.
weights (Union[ List, pd.Series, np.ndarray, ], optional) – Weights to apply for the computation. Defaults to None.
stat (Literal["mean", "std", "var_of_mean", ...], optional) –
Which statistic to calculate on the data. If mean - uses
weighted_mean()
(with inf_rm=True) If std - usesweighted_sd()
(with inf_rm=True) If var_of_mean - usesvar_of_weighted_mean()
(with inf_rm=True) If ci_of_mean - usesci_of_weighted_mean()
(with inf_rm=True) If something else - tries to usestatsmodels.stats.weightstats.DescrStatsW()
.This supports stat such as: std_mean, sum_weights, nobs, etc. See function documentation to see more. (while removing mutual nan using
rm_mutual_nas()
)Defaults to “mean”.
weighted (bool, optional) – If stat is not “mean” or “std”, if to use the weights with the DescrStatsW function. Defaults to True.
numeric_only (bool, optional) – Should the statistics be computed only on numeric columns? If True - then non-numeric columns will be omitted. If False - then
model_matrix()
(with no formula argument) will be used to transfer non-numeric columns to dummy variables. Defaults to False.add_na (bool, optional) – Passed to
model_matrix()
. Relevant only if numeric_only == False and df has non-numeric columns. Defaults to True.**kwargs – extra args to be passed to functions (e.g.: ci_of_weighted_mean)
- Returns:
Returns pd.DataFrame of the output (based on stat argument), for each of the columns in df.
- Return type:
pd.DataFrame
Examples
import numpy as np import pandas as pd from balance.stats_and_plots.weighted_stats import descriptive_stats, weighted_mean, weighted_sd # Without weights x = [1, 2, 3, 4] print(descriptive_stats(pd.DataFrame(x), stat="mean")) print(np.mean(x)) print(weighted_mean(x)) # 0 # 0 2.5 # 2.5 # 0 2.5 # dtype: float64 print(descriptive_stats(pd.DataFrame(x), stat="var_of_mean")) # 0 # 0 0.3125 print(descriptive_stats(pd.DataFrame(x), stat="std")) print(weighted_sd(x)) x2 = pd.Series(x) print(np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)-1) )) # 0 # 0 1.290994 # 0 1.290994 # dtype: float64 # 1.2909944487358056 # Notice that it is different from # print(np.std(x)) # which gives: 1.118033988749895 # Which is the MLE (i.e.: biased, dividing by n and not n-1) estimator for std: # (np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)) )) x2 = pd.Series(x) tmp_sd = np.sqrt(np.sum((x2 - x2.mean()) ** 2) / (len(x) - 1)) tmp_se = tmp_sd / np.sqrt(len(x)) print(descriptive_stats(pd.DataFrame(x), stat="std_mean").iloc[0, 0]) print(tmp_se) # 0.6454972243679029 # 0.6454972243679028 # Weighted results x, w = [1, 2, 3, 4], [1, 2, 3, 4] print(descriptive_stats(pd.DataFrame(x), w, stat="mean")) # 0 # 0 3.0 print(descriptive_stats(pd.DataFrame(x), w, stat="std")) # 0 # 0 1.195229 print(descriptive_stats(pd.DataFrame(x), w, stat="std_mean")) # 0 # 0 0.333333 print(descriptive_stats(pd.DataFrame(x), w, stat="var_of_mean")) # 0 # 0 0.24 print(descriptive_stats(pd.DataFrame(x), w, stat="ci_of_mean", conf_level = 0.99, round_ndigits=3)) # 0 # 0 (1.738, 4.262)
- balance.stats_and_plots.weighted_stats.relative_frequency_table(df: DataFrame | Series, column: str | None = None, w: Series | None = None) DataFrame [source]¶
Creates a relative frequency table by aggregating over a categorical column (column) - optionally weighted by w. I.e.: produce the proportion (or weighted proportion) of rows that appear in each category, relative to the total number of rows (or sum of weights). See: https://en.wikipedia.org/wiki/Frequency_(statistics)#Types.
Used in plotting functions.
- Parameters:
df (pd.DataFrame) – A DataFrame with categorical columns, or a pd.Series of the grouping column.
column (Optional[str]) – The name of the column to be aggregated. If None (default), then it takes the first column of df (if pd.DataFrame), or just uses as is (if pd.Series)
w (Optional[pd.Series], optional) – Optional weights to use when aggregating the relative proportions. If None than assumes weights is 1 for all rows. The defaults is None.
- Returns:
- a pd.DataFrame with columns:
column, the aggregation variable, and,
’prop’, the aggregated (weighted) proportion of rows from each group in ‘column’.
- Return type:
pd.DataFrame
Examples
from balance.stats_and_plots.weighted_stats import relative_frequency_table import pandas as pd df = pd.DataFrame({ 'group': ('a', 'b', 'c', 'c'), 'v1': (1, 2, 3, 4), }) print(relative_frequency_table(df, 'group')) # group prop # 0 a 0.25 # 1 b 0.25 # 2 c 0.50 print(relative_frequency_table(df, 'group', pd.Series((2, 1, 1, 1),))) # group prop # 0 a 0.4 # 1 b 0.2 # 2 c 0.4 # Using a pd.Series: a_series = df['group'] print(relative_frequency_table(a_series)) # group prop # 0 a 0.25 # 1 b 0.25 # 2 c 0.50
- balance.stats_and_plots.weighted_stats.var_of_weighted_mean(v: List | Series | DataFrame | matrix, w: List | Series | ndarray[Any, dtype[_ScalarType_co]] | None = None, inf_rm: bool = False) Series [source]¶
Computes the variance of the weighted average (pi estimator for ratio-mean) of a list of values and their corresponding weights.
If no weights are supplied, it assumes that all values have equal weights of 1.0.
Uses
_prepare_weighted_stat_args()
.v (Union[List, pd.Series, pd.DataFrame, np.matrix]): A series of values. If v is a DataFrame, the weighted variance will be calculated for each column using the same set of weights from w. w (Optional[Union[List, pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The variance of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the weighted variance of each column. The values are of data type np.float64. If inf_rm is False:
If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.
- Return type:
pd.Series
Examples
- ::
from balance.stats_and_plots.weighted_stats import var_of_weighted_mean
# In R: sum((1:4 - mean(1:4))^2 / 4) / (4) # [1] 0.3125 var_of_weighted_mean(pd.Series((1, 2, 3, 4)))
# pd.Series(0.3125)
# For a reproducible R example, see: https://gist.github.com/talgalili/b92cd8cdcbfc287e331a8f27db265c00 var_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# pd.Series(0.24)
- df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}
) w = pd.Series((1, 2, 3, 4)) var_of_weighted_mean(df, w)
# a 0.24 # b 0.00 # dtype: float64
- balance.stats_and_plots.weighted_stats.weighted_mean(v: List | Series | DataFrame | matrix, w: List | Series | ndarray[Any, dtype[_ScalarType_co]] | None = None, inf_rm: bool = False) Series [source]¶
Computes the weighted average of a pandas Series or DataFrame.
If no weights are supplied, it just computes the simple arithmetic mean.
See: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
Uses
_prepare_weighted_stat_args()
.- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to average. None (or np.nan) values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.
w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The weighted mean. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:
If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.
- Return type:
pd.Series(dtype=np.float64)
Examples
- ::
from balance.stats_and_plots.weighted_stats import weighted_mean
- weighted_mean(pd.Series((1, 2, 3, 4)))
# 0 2.5 # dtype: float64
- weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# 0 3.0 # dtype: float64
- df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}
) w = pd.Series((1, 2, 3, 4)) weighted_mean(df, w)
# a 3.0 # b 1.0 # dtype: float64
- balance.stats_and_plots.weighted_stats.weighted_quantile(v: List | Series | DataFrame | ndarray[Any, dtype[_ScalarType_co]] | matrix, quantiles: List | Series | ndarray[Any, dtype[_ScalarType_co]], w: List | Series | ndarray[Any, dtype[_ScalarType_co]] | None = None, inf_rm: bool = False) DataFrame [source]¶
Calculates the weighted quantiles (q) of values (v) based on weights (w).
See
_prepare_weighted_stat_args()
for the pre-processing done to v and w.Based on
statsmodels.stats.weightstats.DescrStatsW()
.- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.array, np.matrix, ]) – values to get the weighted quantiles for.
quantiles (Union[ List, pd.Series, ]) – the quantiles to calculate.
w (Union[ List, pd.Series, np.array, ] optional) – weights. Defaults to None.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
- The index (names p) has the values from quantiles. The columns are based on v:
If it’s a pd.Series it’s one column, if it’s a pd.DataFrame with several columns, than each column in the output corrosponds to the column in v.
- Return type:
pd.DataFrame
- balance.stats_and_plots.weighted_stats.weighted_sd(v: List | Series | DataFrame | matrix, w: List | Series | ndarray[Any, dtype[_ScalarType_co]] | None = None, inf_rm: bool = False) Series [source]¶
Calculate the sample weighted standard deviation
See
weighted_var()
for details.- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – Values.
w (Union[ List, pd.Series, np.ndarray, None, ], optional) – Weights. Defaults to None.
inf_rm (bool, optional) – Remove inf. Defaults to False.
- Returns:
np.sqrt of
weighted_var()
(np.float64)- Return type:
pd.Series
- balance.stats_and_plots.weighted_stats.weighted_var(v: List | Series | DataFrame | ndarray[Any, dtype[_ScalarType_co]] | matrix, w: List | Series | ndarray[Any, dtype[_ScalarType_co]] | None = None, inf_rm: bool = False) Series [source]¶
Calculate the sample weighted variance (a.k.a ‘reliability weights’). This is described here: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights_2 And also used in SDMTools, see: https://www.gnu.org/software/gsl/doc/html/statistics.html#weighted-samples
Uses
weighted_mean()
and_prepare_weighted_stat_args()
.- Parameters:
v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to get the weighted variance for. None values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.
w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.
- Returns:
The weighted variance. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:
If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.
- Return type:
pd.Series[np.float64]