balance.stats_and_plots.weighted_comparisons_stats

balance.stats_and_plots.weighted_comparisons_stats.asmd(sample_df: DataFrame, target_df: DataFrame, sample_weights: List | Series | ndarray | None = None, target_weights: List | Series | ndarray | None = None, std_type: Literal['target', 'sample', 'pooled'] = 'target', aggregate_by_main_covar: bool = False) Series[source]

Calculate the Absolute Standardized Mean Deviation (ASMD) between the columns of two DataFrames (or BalanceDFs). It uses weighted average and std for the calculations. This is the same as taking the absolute value of Cohen’s d statistic, with a specific choice of the standard deviation (std).

https://en.wikipedia.org/wiki/Effect_size#Cohen’s_d

As opposed to Cohen’s d, the current asmd implementation has several options of std calculation, see the arguments section for details.

Unlike in R package {cobalt}, in the current implementation of asmd: - the absolute value is taken - un-represented levels of categorical variables are treated as missing,

not as 0.

  • differences for categorical variables are also weighted by default

The function omits columns that starts with the name “_is_na_”

If column names of sample_df and target_df are different, it will only calculate asmd for the overlapping columns. The rest will be np.nan. The mean(asmd) will be calculated while treating the nan values as 0s.

Parameters:
  • sample_df (pd.DataFrame) – source group of the asmd comparison

  • target_df (pd.DataFrame) – target group of the asmd comparison. The column names should be the same as the ones from sample_df.

  • sample_weights (Union[ List, pd.Series, np.ndarray, ], optional) – weights to use. The default is None. If no weights are passed (None), it will use an array of 1s.

  • target_weights (Union[ List, pd.Series, np.ndarray, ], optional) – weights to use. The default is None. If no weights are passed (None), it will use an array of 1s.

  • std_type (Literal["target", "sample", "pooled"], optional) –

    How the standard deviation should be calculated. The options are: “target”, “sample” and “pooled”. Defaults to “target”. “target” means we use the std from the target population. “sample” means we use the std from the sample population. “pooled” means we use the simple arithmetic average of

    the variance from the sample and target population. We then take the square root of that value to get the pooled std. Notice that this is done while giving the same weight to both sources. I.e.: there is NO over/under weighting sample or target based on their respective sample sizes (in contrast to how pooled sd is calculated in Cohen’s d statistic).

  • aggregate_by_main_covar (bool) – If to use _aggregate_asmd_by_main_covar() to aggregate (average) the asmd based on variable name. Default is False.

Returns:

a Series indexed on the names of the columns in the input DataFrames. The values (of type np.float64) are of the ASMD calculation. The last element is ‘mean(asmd)’, which is the average of the calculated ASMD values.

Return type:

pd.Series

Examples

import numpy as np
import pandas as pd
from balance.stats_and_plots import weighted_comparisons_stats

a1 = pd.Series((1, 2))
b1 = pd.Series((-1, 1))
a2 = pd.Series((3, 4))
b2 = pd.Series((-2, 2))
w1 = pd.Series((1, 1))
w2 = w1

r = weighted_comparisons_stats.asmd(
            pd.DataFrame({"a": a1, "b": b1}),
            pd.DataFrame({"a": a2, "b": b2}),
            w1,
            w2,
        )

exp_a = np.abs(a1.mean() - a2.mean()) / a2.std()
exp_b = np.abs(b1.mean() - b2.mean()) / b2.std()
print(r)
print(exp_a)
print(exp_b)

# output:
# {'a': 2.82842712474619, 'b': 0.0, 'mean(asmd)': 1.414213562373095}
# 2.82842712474619
# 0.0



a1 = pd.Series((1, 2))
b1_A = pd.Series((1, 3))
b1_B = pd.Series((-1, -3))
a2 = pd.Series((3, 4))
b2_A = pd.Series((2, 3))
b2_B = pd.Series((-2, -3))
w1 = pd.Series((1, 1))
w2 = w1

r = weighted_comparisons_stats.asmd(
    pd.DataFrame({"a": a1, "b[A]": b1_A, "b[B]": b1_B}),
    pd.DataFrame({"a": a2, "b[A]": b2_A, "b[B]": b2_B}),
    w1,
    w2,
).to_dict()

print(r)
# {'a': 2.82842712474619, 'b[A]': 0.7071067811865475, 'b[B]': 0.7071067811865475, 'mean(asmd)': 1.7677669529663689}

# Check that using aggregate_by_main_covar works
r = weighted_comparisons_stats.asmd(
    pd.DataFrame({"a": a1, "b[A]": b1_A, "b[B]": b1_B}),
    pd.DataFrame({"a": a2, "b[A]": b2_A, "b[B]": b2_B}),
    w1,
    w2,
    "target",
    True,
).to_dict()

print(r)
# {'a': 2.82842712474619, 'b': 0.7071067811865475, 'mean(asmd)': 1.7677669529663689}
balance.stats_and_plots.weighted_comparisons_stats.asmd_improvement(sample_before: DataFrame, sample_after: DataFrame, target: DataFrame, sample_before_weights: List | Series | ndarray | None = None, sample_after_weights: List | Series | ndarray | None = None, target_weights: List | Series | ndarray | None = None) float64[source]

Calculates the improvement in mean(asmd) from before to after applying some weight adjustment.

Parameters:
  • sample_before (pd.DataFrame) – DataFrame of the sample before adjustments.

  • sample_after (pd.DataFrame) – DataFrame of the sample after adjustments (should be identical to sample_before. But could be used to compare two populations).

  • target (pd.DataFrame) – DataFrame of the target population.

  • sample_before_weights (Union[ List, pd.Series, np.ndarray, ], optional) – Weights before adjustments (i.e.: design weights). Defaults to None.

  • sample_after_weights (Union[ List, pd.Series, np.ndarray, ], optional) – Weights after some adjustment. Defaults to None.

  • target_weights (Union[ List, pd.Series, np.ndarray, ], optional) – Design weights of the target population. Defaults to None.

Returns:

The improvement is taking the (before_mean_asmd-after_mean_asmd)/before_mean_asmd. The asmd is calculated using asmd().

Return type:

np.float64

balance.stats_and_plots.weighted_comparisons_stats.outcome_variance_ratio(df_numerator: DataFrame, df_denominator: DataFrame, w_numerator: List | Series | ndarray | None = None, w_denominator: List | Series | ndarray | None = None) Series[source]

Calculate ratio of weighted variances of two DataFrames

Directly calculating the empirical ratio of variance of the outcomes before and after weighting. Notice that this is different than design effect. The Deff estimates the ratio of variances of the weighted means, while this function calculates the ratio of empirical weighted variance of the data.

Parameters:
  • df_numerator (pd.DataFrame) – df_numerator

  • df_denominator (pd.DataFrame) – df_denominator

  • w_numerator (Union[ List, pd.Series, np.ndarray, None, ], optional) – w_numerator. Defaults to None.

  • w_denominator (Union[ List, pd.Series, np.ndarray, None, ], optional) – w_denominator. Defaults to None.

Returns:

(np.float64) A series of calculated ratio of variances for each outcome.

Return type:

pd.Series