balance.stats_and_plots.weighted_stats

balance.stats_and_plots.weighted_stats.ci_of_weighted_mean(v: List[float] | Series | DataFrame | ndarray, w: List[float] | Series | ndarray | None = None, conf_level: float = 0.95, round_ndigits: int | None = None, inf_rm: bool = False) Series[source]

Computes the confidence interval of the weighted mean of a list of values and their corresponding weights.

If no weights are supplied, it assumes that all values have equal weights of 1.0.

v (Union[List[float], pd.Series, pd.DataFrame, np.ndarray]): A series of values. If v is a DataFrame, the weighted mean and its confidence interval will be calculated for each column using the same set of weights from w. w (Optional[Union[List[float], pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. conf_level (float, optional): Confidence level for the interval, between 0 and 1. Defaults to 0.95. round_ndigits (Optional[int], optional): Number of decimal places to round the confidence interval. If None, the values will not be rounded. Defaults to None. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The confidence interval of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the confidence interval of each column. The values are of data type Tuple[np.float64, np.float64]. If inf_rm is False:

If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.

Return type:

pd.Series

Examples

::

from balance.stats_and_plots.weighted_stats import ci_of_weighted_mean

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)))

# 0 (1.404346824279273, 3.5956531757207273) # dtype: object

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()

# [(1.404, 3.596)]

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))

# 0 (2.039817664728938, 3.960182335271062) # dtype: object

ci_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)), round_ndigits = 3).to_list()

# [(2.04, 3.96)]

df = pd.DataFrame(

{“a”: [1,2,3,4], “b”: [1,1,1,1]}

) w = pd.Series((1, 2, 3, 4)) ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3)

# a (1.738, 4.262) # b (1.0, 1.0) # dtype: object

ci_of_weighted_mean(df, w, conf_level = 0.99, round_ndigits = 3).to_dict()

# {‘a’: (1.738, 4.262), ‘b’: (1.0, 1.0)}

balance.stats_and_plots.weighted_stats.descriptive_stats(df: DataFrame, weights: List | Series | ndarray | None = None, stat: Literal['mean', 'std', 'var_of_mean', 'ci_of_mean', '...'] = 'mean', weighted: bool = True, numeric_only: bool = False, add_na: bool = True, **kwargs) DataFrame[source]

Computes weighted statistics (e.g.: mean, std) on a DataFrame

This function gets a DataFrame + weights and apply some weighted aggregation function (mean, std, or DescrStatsW). The main benefit of the function is that if the DataFrame includes non-numeric columns, then descriptive_stats will first run model_matrix() to create some numeric dummary variable that will then be processed.

Parameters:
  • df (pd.DataFrame) – Some DataFrame to get stats (mean, std, etc.) for.

  • weights (Union[ List, pd.Series, np.ndarray, ], optional) – Weights to apply for the computation. Defaults to None.

  • stat (Literal["mean", "std", "var_of_mean", ...], optional) –

    Which statistic to calculate on the data. If mean - uses weighted_mean() (with inf_rm=True) If std - uses weighted_sd() (with inf_rm=True) If var_of_mean - uses var_of_weighted_mean() (with inf_rm=True) If ci_of_mean - uses ci_of_weighted_mean() (with inf_rm=True) If something else - tries to use statsmodels.stats.weightstats.DescrStatsW().

    This supports stat such as: std_mean, sum_weights, nobs, etc. See function documentation to see more. (while removing mutual nan using rm_mutual_nas())

    Defaults to “mean”.

  • weighted (bool, optional) – If stat is not “mean” or “std”, if to use the weights with the DescrStatsW function. Defaults to True.

  • numeric_only (bool, optional) – Should the statistics be computed only on numeric columns? If True - then non-numeric columns will be omitted. If False - then model_matrix() (with no formula argument) will be used to transfer non-numeric columns to dummy variables. Defaults to False.

  • add_na (bool, optional) – Passed to model_matrix(). Relevant only if numeric_only == False and df has non-numeric columns. Defaults to True.

  • **kwargs – extra args to be passed to functions (e.g.: ci_of_weighted_mean)

Returns:

Returns pd.DataFrame of the output (based on stat argument), for each of the columns in df.

Return type:

pd.DataFrame

Examples

import numpy as np
import pandas as pd
from balance.stats_and_plots.weighted_stats import descriptive_stats, weighted_mean, weighted_sd

# Without weights
x = [1, 2, 3, 4]
print(descriptive_stats(pd.DataFrame(x), stat="mean"))
print(np.mean(x))
print(weighted_mean(x))
    #     0
    # 0  2.5
    # 2.5
    # 0    2.5
    # dtype: float64

print(descriptive_stats(pd.DataFrame(x), stat="var_of_mean"))
    #         0
    # 0  0.3125

print(descriptive_stats(pd.DataFrame(x), stat="std"))
print(weighted_sd(x))
x2 = pd.Series(x)
print(np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)-1) ))
    #         0
    # 0  1.290994
    # 0    1.290994
    # dtype: float64
    # 1.2909944487358056
    # Notice that it is different from
    # print(np.std(x))
    # which gives: 1.118033988749895
    # Which is the MLE (i.e.: biased, dividing by n and not n-1) estimator for std:
    # (np.sqrt( np.sum((x2 - x2.mean())**2) / (len(x)) ))

x2 = pd.Series(x)
tmp_sd = np.sqrt(np.sum((x2 - x2.mean()) ** 2) / (len(x) - 1))
tmp_se = tmp_sd / np.sqrt(len(x))
print(descriptive_stats(pd.DataFrame(x), stat="std_mean").iloc[0, 0])
print(tmp_se)
    # 0.6454972243679029
    # 0.6454972243679028

# Weighted results
x, w = [1, 2, 3, 4], [1, 2, 3, 4]
print(descriptive_stats(pd.DataFrame(x), w, stat="mean"))
    #      0
    # 0  3.0
print(descriptive_stats(pd.DataFrame(x), w, stat="std"))
    #           0
    # 0  1.195229
print(descriptive_stats(pd.DataFrame(x), w, stat="std_mean"))
    #           0
    # 0  0.333333
print(descriptive_stats(pd.DataFrame(x), w, stat="var_of_mean"))
    #       0
    # 0  0.24
print(descriptive_stats(pd.DataFrame(x), w, stat="ci_of_mean", conf_level = 0.99, round_ndigits=3))
    #                 0
    # 0  (1.738, 4.262)
balance.stats_and_plots.weighted_stats.relative_frequency_table(df: DataFrame | Series, column: str | None = None, w: Series | None = None) DataFrame[source]

Creates a relative frequency table by aggregating over a categorical column (column) - optionally weighted by w. I.e.: produce the proportion (or weighted proportion) of rows that appear in each category, relative to the total number of rows (or sum of weights). See: https://en.wikipedia.org/wiki/Frequency_(statistics)#Types.

Used in plotting functions.

Parameters:
  • df (pd.DataFrame) – A DataFrame with categorical columns, or a pd.Series of the grouping column.

  • column (Optional[str]) – The name of the column to be aggregated. If None (default), then it takes the first column of df (if pd.DataFrame), or just uses as is (if pd.Series)

  • w (Optional[pd.Series], optional) – Optional weights to use when aggregating the relative proportions. If None than assumes weights is 1 for all rows. The defaults is None.

Returns:

a pd.DataFrame with columns:
  • column, the aggregation variable, and,

  • ’prop’, the aggregated (weighted) proportion of rows from each group in ‘column’.

Return type:

pd.DataFrame

Examples

from balance.stats_and_plots.weighted_stats import relative_frequency_table
import pandas as pd

df = pd.DataFrame({
    'group': ('a', 'b', 'c', 'c'),
    'v1': (1, 2, 3, 4),
})
print(relative_frequency_table(df, 'group'))
    #     group  prop
    #   0     a  0.25
    #   1     b  0.25
    #   2     c  0.50
print(relative_frequency_table(df, 'group', pd.Series((2, 1, 1, 1),)))
    #     group  prop
    #   0     a   0.4
    #   1     b   0.2
    #   2     c   0.4

# Using a pd.Series:
a_series = df['group']
print(relative_frequency_table(a_series))
    #   group  prop
    # 0     a  0.25
    # 1     b  0.25
    # 2     c  0.50
balance.stats_and_plots.weighted_stats.var_of_weighted_mean(v: List | Series | DataFrame | matrix, w: List | Series | ndarray | None = None, inf_rm: bool = False) Series[source]

Computes the variance of the weighted average (pi estimator for ratio-mean) of a list of values and their corresponding weights.

If no weights are supplied, it assumes that all values have equal weights of 1.0.

See: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Variance_of_the_weighted_mean_(%CF%80-estimator_for_ratio-mean)

Uses _prepare_weighted_stat_args().

v (Union[List, pd.Series, pd.DataFrame, np.matrix]): A series of values. If v is a DataFrame, the weighted variance will be calculated for each column using the same set of weights from w. w (Optional[Union[List, pd.Series, np.ndarray]]): A series of weights to be used with v. If None, all values will be weighted equally. inf_rm (bool, optional): Whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The variance of the weighted mean. If v is a DataFrame with several columns, the pd.Series will have a value for the weighted variance of each column. The values are of data type np.float64. If inf_rm is False:

If v has infinite values, the output will be Inf. If w has infinite values, the output will be np.nan.

Return type:

pd.Series

Examples

::

from balance.stats_and_plots.weighted_stats import var_of_weighted_mean

# In R: sum((1:4 - mean(1:4))^2 / 4) / (4) # [1] 0.3125 var_of_weighted_mean(pd.Series((1, 2, 3, 4)))

# pd.Series(0.3125)

# For a reproducible R example, see: https://gist.github.com/talgalili/b92cd8cdcbfc287e331a8f27db265c00 var_of_weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))

# pd.Series(0.24)

df = pd.DataFrame(

{“a”: [1,2,3,4], “b”: [1,1,1,1]}

) w = pd.Series((1, 2, 3, 4)) var_of_weighted_mean(df, w)

# a 0.24 # b 0.00 # dtype: float64

balance.stats_and_plots.weighted_stats.weighted_mean(v: List | Series | DataFrame | matrix, w: List | Series | ndarray | None = None, inf_rm: bool = False) Series[source]

Computes the weighted average of a pandas Series or DataFrame.

If no weights are supplied, it just computes the simple arithmetic mean.

See: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean

Uses _prepare_weighted_stat_args().

Parameters:
  • v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to average. None (or np.nan) values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.

  • w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.

  • inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The weighted mean. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:

If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.

Return type:

pd.Series(dtype=np.float64)

Examples

::

from balance.stats_and_plots.weighted_stats import weighted_mean

weighted_mean(pd.Series((1, 2, 3, 4)))

# 0 2.5 # dtype: float64

weighted_mean(pd.Series((1, 2, 3, 4)), pd.Series((1, 2, 3, 4)))

# 0 3.0 # dtype: float64

df = pd.DataFrame(

{“a”: [1,2,3,4], “b”: [1,1,1,1]}

) w = pd.Series((1, 2, 3, 4)) weighted_mean(df, w)

# a 3.0 # b 1.0 # dtype: float64

balance.stats_and_plots.weighted_stats.weighted_quantile(v: List | Series | DataFrame | ndarray | matrix, quantiles: List | Series | ndarray, w: List | Series | ndarray | None = None, inf_rm: bool = False) DataFrame[source]

Calculates the weighted quantiles (q) of values (v) based on weights (w).

See _prepare_weighted_stat_args() for the pre-processing done to v and w.

Based on statsmodels.stats.weightstats.DescrStatsW().

Parameters:
  • v (Union[ List, pd.Series, pd.DataFrame, np.array, np.matrix, ]) – values to get the weighted quantiles for.

  • quantiles (Union[ List, pd.Series, ]) – the quantiles to calculate.

  • w (Union[ List, pd.Series, np.array, ] optional) – weights. Defaults to None.

  • inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The index (names p) has the values from quantiles. The columns are based on v:

If it’s a pd.Series it’s one column, if it’s a pd.DataFrame with several columns, than each column in the output corrosponds to the column in v.

Return type:

pd.DataFrame

balance.stats_and_plots.weighted_stats.weighted_sd(v: List | Series | DataFrame | matrix, w: List | Series | ndarray | None = None, inf_rm: bool = False) Series[source]

Calculate the sample weighted standard deviation

See weighted_var() for details.

Parameters:
  • v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – Values.

  • w (Union[ List, pd.Series, np.ndarray, None, ], optional) – Weights. Defaults to None.

  • inf_rm (bool, optional) – Remove inf. Defaults to False.

Returns:

np.sqrt of weighted_var() (np.float64)

Return type:

pd.Series

balance.stats_and_plots.weighted_stats.weighted_var(v: List | Series | DataFrame | ndarray | matrix, w: List | Series | ndarray | None = None, inf_rm: bool = False) Series[source]

Calculate the sample weighted variance (a.k.a ‘reliability weights’). This is described here: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights_2 And also used in SDMTools, see: https://www.gnu.org/software/gsl/doc/html/statistics.html#weighted-samples

Uses weighted_mean() and _prepare_weighted_stat_args().

Parameters:
  • v (Union[ List, pd.Series, pd.DataFrame, np.matrix, ]) – values to get the weighted variance for. None values are treated like 0. If v is a DataFrame than the average of the values from each column will be returned, all using the same set of weights from w.

  • w (Union[ List, pd.Series], optional) – weights. Defaults to None. If there is None value in the weights, that value will be ignored from the calculation.

  • inf_rm (bool, optional) – whether to remove infinite (from weights or values) from the computation. Defaults to False.

Returns:

The weighted variance. If v is a DataFrame with several columns than the pd.Series will have a value for the weighted mean of each of the columns. If inf_rm=False then:

If v has Inf then the output will be Inf. If w has Inf then the output will be np.nan.

Return type:

pd.Series[np.float64]