balance.stats_and_plots.weighted_stats¶

Computes the confidence interval of the weighted mean of a list of values and their corresponding weights.

If no weights are supplied, it assumes that all values have equal weights of 1.0.

v (Union[List[float], pd.Series, pd.DataFrame, np.ndarray]): A series of values. If v is a DataFrame, the weighted mean and its confidence interval will be calculated for each column using the same set of weights from w. w (Optional[Union[List[float], pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. conf_level (float, optional): Confidence level for the interval, between 0 and 1. Defaults to 0.95. round_ndigits (Optional[int], optional): Number of decimal places to round the confidence interval. If None, the values will not be rounded. Defaults to None. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The confidence interval of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the confidence interval of each column. The values are of data type Tuple[np.float64, np.float64]. If inf_rm is False:

If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.

Return type:

pd.Series

Examples: .. code-block:: python

from balance.stats_and_plots.weighted_stats import ci_of_weighted_mean

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)))
# 0 (1.404346824279273, 3.5956531757207273) # dtype: object

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()
# [(1.404, 3.596)]

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# 0 (2.039817664728938, 3.960182335271062) # dtype: object

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()
# [(2.04, 3.96)]

df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}

) w = pd.Series((1, 2, 3, 4)) ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3)

# a (1.738, 4.262) # b (1.0, 1.0) # dtype: object

ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3).to_dict()
# {‘a’: (1.738, 4.262), ‘b’: (1.0, 1.0)}

balance.stats_and_plots.weighted_stats.descriptive_stats(df: pd.DataFrame, weights: list[Any] | pd.Series | npt.NDArray | None = None, stat: str = 'mean', weighted: bool = True, numeric_only: bool = False, add_na: bool = True, formula: str | list[str] | None = None, **kwargs: Any) → pd.DataFrame[source]¶

Computes weighted statistics (e.g.: mean, std) on a DataFrame

This function gets a DataFrame + weights and apply some weighted aggregation function (mean, std, or DescrStatsW). The main benefit of the function is that if the DataFrame includes non-numeric columns, then descriptive_stats will first run model_matrix() to create some numeric dummary variable that will then be processed.

Parameters:

df (pd.DataFrame) – Some DataFrame to get stats (mean, std, etc.) for.
weights (Union[ List, pd.Series, np.ndarray, ], optional) – Weights to apply for the computation. Defaults to None.
stat (str, optional) –
Which statistic to calculate on the data. If mean - uses weighted_mean() (with inf_rm=True) If std - uses weighted_sd() (with inf_rm=True) If var_of_mean - uses var_of_weighted_mean() (with inf_rm=True) If ci_of_mean - uses ci_of_weighted_mean() (with inf_rm=True) If something else - tries to use statsmodels.stats.weightstats.DescrStatsW().

This supports stat such as: std_mean, sum_weights, nobs, etc. See function documentation to see more. (while removing mutual nan using rm_mutual_nas())

Defaults to “mean”.
weighted (bool, optional) – If stat is not “mean” or “std”, if to use the weights with the DescrStatsW function. Defaults to True.
numeric_only (bool, optional) – Should the statistics be computed only on numeric columns? If True - then non-numeric columns will be omitted. If False - then model_matrix() (with no formula argument) will be used to transfer non-numeric columns to dummy variables. Defaults to False.
add_na (bool, optional) – Passed to model_matrix(). Relevant only if numeric_only == False and df has non-numeric columns. Defaults to True.
formula (Optional[Union[str, List[str]]], optional) – Formula passed to model_matrix(). When provided, the formula is always applied, allowing customization of which columns/dummies are used in the statistics (and taking precedence over numeric_only). Defaults to None.
**kwargs – extra args to be passed to functions (e.g.: ci_of_weighted_mean)

Returns:

Returns pd.DataFrame of the output (based on stat argument), for each of the columns in df.

Return type:

pd.DataFrame

Examples: .. code-block:: python

import numpy as np import pandas as pd from balance.stats_and_plots.weighted_stats import descriptive_stats, weighted_mean, weighted_sd

# Without weights x = [1, 2, 3, 4] print(descriptive_stats(pd.DataFrame(x), stat=”mean”)) print(np.mean(x)) print(weighted_mean(x))

# 0 # 0 2.5 # 2.5 # 0 2.5 # dtype: float64

print(descriptive_stats(pd.DataFrame(x), stat=”var_of_mean”))
# 0 # 0 0.3125

print(descriptive_stats(pd.DataFrame(x), stat=”std”)) print(weighted_sd(x)) x2 = pd.Series(x) print(np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)-1) ))

# 0 # 0 1.290994 # 0 1.290994 # dtype: float64 # 1.2909944487358056 # Notice that it is different from # print(np.std(x)) # which gives: 1.118033988749895 # Which is the MLE (i.e.: biased, dividing by n and not n-1) estimator for std: # (np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)) ))

x2 = pd.Series(x) tmp_sd = np.sqrt(np.sum((x2 - x2.mean()) ** 2) / (len(x) - 1)) tmp_se = tmp_sd / np.sqrt(len(x)) print(descriptive_stats(pd.DataFrame(x), stat=”std_mean”).iloc[0, 0]) print(tmp_se)

# 0.6454972243679029 # 0.6454972243679028

# Weighted results x, w = [1, 2, 3, 4], [1, 2, 3, 4] print(descriptive_stats(pd.DataFrame(x), w, stat=”mean”))

# 0 # 0 3.0

print(descriptive_stats(pd.DataFrame(x), w, stat=”std”))
# 0 # 0 1.195229

print(descriptive_stats(pd.DataFrame(x), w, stat=”std_mean”))
# 0 # 0 0.333333

print(descriptive_stats(pd.DataFrame(x), w, stat=”var_of_mean”))
# 0 # 0 0.24

print(descriptive_stats(pd.DataFrame(x), w, stat=”ci_of_mean”, conf_level = 0.99, round_ndigits=3))
# 0 # 0 (1.738, 4.262)

# Formula examples df = pd.DataFrame({“num”: [1, 2, 3], “group”: [“a”, “b”, “a”]}) # Default formula includes all variables and encodes categoricals. print(descriptive_stats(df, stat=”mean”))

# group[a] group[b] num # 0 0.67 0.33 2.0

print(descriptive_stats(df, stat=”mean”, formula=”num”))
# num # 0 2.0

print(descriptive_stats(df, stat=”mean”, formula=”group”))
# group[a] group[b] # 0 0.67 0.33

print(descriptive_stats(df, stat=”mean”, formula=”num + group”))
# group[a] group[b] num # 0 0.67 0.33 2.0

print(descriptive_stats(df, stat=”mean”, formula=”num + group + num:group”))
# group[a] group[b] num num:group[T.b] # 0 0.67 0.33 2.0 0.67

balance.stats_and_plots.weighted_stats.relative_frequency_table(df: pd.DataFrame | pd.Series, column: str | None = None, w: pd.Series | None = None) → pd.DataFrame[source]¶

Creates a relative frequency table by aggregating over a categorical column (column) - optionally weighted by w. I.e.: produce the proportion (or weighted proportion) of rows that appear in each category, relative to the total number of rows (or sum of weights). See: https://en.wikipedia.org/wiki/Frequency_(statistics)#Types.

Used in plotting functions.

Parameters:

df (pd.DataFrame) – A DataFrame with categorical columns, or a pd.Series of the grouping column.
column (Optional[str]) – The name of the column to be aggregated. If None (default), then it takes the first column of df (if pd.DataFrame), or just uses as is (if pd.Series)
w (Optional[pd.Series], optional) – Optional weights to use when aggregating the relative proportions. If None than assumes weights is 1 for all rows. The defaults is None.

Returns:

a pd.DataFrame with columns:

column, the aggregation variable, and,
’prop’, the aggregated (weighted) proportion of rows from each group in ‘column’.

Return type:

pd.DataFrame

Examples: .. code-block:: python

from balance.stats_and_plots.weighted_stats import relative_frequency_table import pandas as pd

df = pd.DataFrame({
‘group’: (‘a’, ‘b’, ‘c’, ‘c’), ‘v1’: (1, 2, 3, 4),

}) print(relative_frequency_table(df, ‘group’))

# group prop # 0 a 0.25 # 1 b 0.25 # 2 c 0.50

print(relative_frequency_table(df, ‘group’, pd.Series((2, 1, 1, 1),)))
# group prop # 0 a 0.4 # 1 b 0.2 # 2 c 0.4

# Using a pd.Series: a_series = df[‘group’] print(relative_frequency_table(a_series))

# group prop # 0 a 0.25 # 1 b 0.25 # 2 c 0.50

Computes the variance of the weighted average (pi estimator for ratio-mean) of a list of values and their corresponding weights.

If no weights are supplied, it assumes that all values have equal weights of 1.0.

See: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Variance_of_the_weighted_mean_(%CF%80-estimator_for_ratio-mean)

Uses _prepare_weighted_stat_args().

v (Union[List, pd.Series, pd.DataFrame, np.matrix]): A series of values. If v is a DataFrame, the weighted variance will be calculated for each column using the same set of weights from w. w (Optional[Union[List, pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The variance of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the weighted variance of each column. The values are of data type np.float64. If inf_rm is False:

If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.

Return type:

pd.Series

Examples: .. code-block:: python

from balance.stats_and_plots.weighted_stats import var_of_weighted_mean

# In R: sum((1:4 - mean(1:4))^2 / 4) / (4) # [1] 0.3125 var_of_weighted_mean(pd.Series((1, 2, 3, 4)))

# pd.Series(0.3125)

# For a reproducible R example, see: https://gist.github.com/talgalili/b92cd8cdcbfc287e331a8f27db265c00 var_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))

# pd.Series(0.24)

df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}

) w = pd.Series((1, 2, 3, 4)) var_of_weighted_mean(df, w)

# a 0.24 # b 0.00 # dtype: float64

Computes the weighted average of a pandas Series or DataFrame.

If no weights are supplied, it just computes the simple arithmetic mean.

See: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean

Uses _prepare_weighted_stat_args().

Parameters:

v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to average. None (or np.nan) values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.
w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The weighted mean. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:

If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.

Return type:

pd.Series(dtype=np.float64)

Examples: .. code-block:: python

from balance.stats_and_plots.weighted_stats import weighted_mean

weighted_mean(pd.Series((1, 2, 3, 4)))
# 0 2.5 # dtype: float64

weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))
# 0 3.0 # dtype: float64

df = pd.DataFrame(
{“a”: [1,2,3,4], “b”: [1,1,1,1]}

) w = pd.Series((1, 2, 3, 4)) weighted_mean(df, w)

# a 3.0 # b 1.0 # dtype: float64

Calculates the weighted quantiles (q) of values (v) based on weights (w).

See _prepare_weighted_stat_args() for the pre-processing done to v and w.

Based on statsmodels.stats.weightstats.DescrStatsW().

Parameters:

v (Union[ List, pd.Series, pd.DataFrame, np.array, np.matrix, ]) – values to get the weighted quantiles for.
quantiles (Union[ List, pd.Series, ]) – the quantiles to calculate.
w (Union[ List, pd.Series, np.array, ] optional) – weights. Defaults to None.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The index (names p) has the values from quantiles. The columns are based on v:: If it’s a pd.Series it’s one column, if it’s a pd.DataFrame with several columns, than each column in the output corrosponds to the column in v.

Return type:

pd.DataFrame

Examples: .. code-block:: python

from balance.stats_and_plots.weighted_stats import weighted_quantile weighted_quantile([1, 2, 3], [0.5], w=[1, 1, 1]).iloc[0, 0] # 2.0

Calculate the sample weighted standard deviation

See weighted_var() for details.

Parameters:

v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – Values.
w (Union[ List, pd.Series, np.ndarray, None, ], optional) – Weights. Defaults to None.
inf_rm (bool, optional) – Remove inf. Defaults to False.

Returns:

np.sqrt of weighted_var() (np.float64)

Return type:

pd.Series

Examples: .. code-block:: python

import pandas as pd from balance.stats_and_plots.weighted_stats import weighted_sd weighted_sd(pd.DataFrame({“x”: [1, 2, 3]}), [1, 1, 1]).tolist() # [1.0]

Calculate the sample weighted variance (a.k.a ‘reliability weights’). This is described here: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights_2 And also used in SDMTools, see: https://www.gnu.org/software/gsl/doc/html/statistics.html#weighted-samples

Uses weighted_mean() and _prepare_weighted_stat_args().

Parameters:

v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to get the weighted variance for. None values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.
w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.
inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The weighted variance. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:

If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.

Return type:

pd.Series[np.float64]

Examples: .. code-block:: python

import pandas as pd from balance.stats_and_plots.weighted_stats import weighted_var weighted_var(pd.DataFrame({“x”: [1, 2, 3]}), [1, 1, 1]).tolist() # [1.0]

balance.stats_and_plots.weighted_stats¶

Table of Contents

Previous topic

Next topic

This Page