balance.balancedf_class

class balance.balancedf_class.BalanceCovarsDF(sample: Sample)[source]
from_frame(df: DataFrame, weights=typing.Optional[pandas.core.series.Series]) BalanceCovarsDF[source]

A factory function to create a BalanceCovarsDF from a df.

Although generally the main way the object is created is through the __init__ method.

Parameters:
  • self (BalanceCovarsDF) – Object

  • df (pd.DataFrame) – A df.

  • weights (Optional[pd.Series], optional) – _description_. Defaults to None.

Returns:

Object.

Return type:

BalanceCovarsDF

class balance.balancedf_class.BalanceDF(df: DataFrame, sample: Sample, name: Literal['outcomes', 'weights', 'covars'])[source]

Wrapper class around a Sample which provides additional balance-specific functionality

asmd(on_linked_samples: bool = True, target: BalanceDF | None = None, aggregate_by_main_covar: bool = False, **kwargs) DataFrame[source]

ASMD is the absolute difference of the means of two groups (say, P and T), divided by some standard deviation (std). It can be std of P or of T, or of P and T. These are all variations on the absolute value of cohen’s d (see: https://en.wikipedia.org/wiki/Effect_size#Cohen’s_d).

We can use asmd to compares multiple Samples (with and without adjustment) to a target population.

Parameters:
  • self (BalanceDF) – Object from sample (with/without adjustment, but it needs some target)

  • on_linked_samples (bool, optional) – If to compare also to linked sample objects (specifically: unadjusted), or not. Defaults to True.

  • target (Optional["BalanceDF"], optional) – A BalanceDF (of the same type as the one used in self) to compare against. If None then it looks for a target in the self linked objects. Defaults to None.

  • aggregate_by_main_covar (bool, optional) – Defaults to False. If True, it will make sure to return the asmd DataFrame after averaging all the columns from using the one-hot encoding for categorical variables. See ::_aggregate_asmd_by_main_covar:: for more details.

Raises:

ValueError – If self has no target and no target is supplied.

Returns:

If on_linked_samples is False, then only one row (index name depends on BalanceDF type, e.g.: covars), with asmd of self vs the target (depending if it’s covars, or something else). If on_linked_samples is True, then two rows per source (self, unadjusted), each with the asmd compared to target, and a third row for the difference (self-unadjusted).

Return type:

pd.DataFrame

Examples

import pandas as pd
from balance.sample_class import Sample

from copy import deepcopy


s1 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3, 1),
            "b": (-42, 8, 2, -42),
            "o": (7, 8, 9, 10),
            "c": ("x", "y", "z", "v"),
            "id": (1, 2, 3, 4),
            "w": (0.5, 2, 1, 1),
        }
    ),
    id_column="id",
    weight_column="w",
    outcome_columns="o",
)

s2 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3),
            "b": (4, 6, 8),
            "id": (1, 2, 3),
            "w": (0.5, 1, 2),
            "c": ("x", "y", "z"),
        }
    ),
    id_column="id",
    weight_column="w",
)

s3 = s1.set_target(s2)
s3_null = s3.adjust(method="null")

s3_null_madeup_weights = deepcopy(s3_null)
s3_null_madeup_weights.set_weights((1, 2, 3, 1))

print(s3_null.covars().asmd().round(3))
    #                     a      b  c[v]   c[x]   c[y]   c[z]  mean(asmd)
    # source
    # self               0.56  8.747   NaN  0.069  0.266  0.533       3.175
    # unadjusted         0.56  8.747   NaN  0.069  0.266  0.533       3.175
    # unadjusted - self  0.00  0.000   NaN  0.000  0.000  0.000       0.000

# show that on_linked_samples = False works:
print(s3_null.covars().asmd(on_linked_samples = False).round(3))
    #            a      b  c[v]   c[x]   c[y]   c[z]  mean(asmd)
    # index
    # covars  0.56  8.747   NaN  0.069  0.266  0.533       3.175

# verify this also works when we have some weights
print(s3_null_madeup_weights.covars().asmd())
    #                           a         b  c[v]  ...      c[y]      c[z]  mean(asmd)
    # source                                       ...
    # self               0.296500  8.153742   NaN  ...  0.000000  0.218218    2.834932
    # unadjusted         0.560055  8.746742   NaN  ...  0.265606  0.533422    3.174566
    # unadjusted - self  0.263555  0.592999   NaN  ...  0.265606  0.315204    0.33963
asmd_improvement(unadjusted: BalanceDF | None = None, target: BalanceDF | None = None) float64[source]

Calculates the improvement in mean(asmd) from before to after applying some weight adjustment.

See weighted_comparisons_stats.asmd_improvement() for details.

Parameters:
  • self (BalanceDF) – BalanceDF (e.g.: of self after adjustment)

  • unadjusted (Optional["BalanceDF"], optional) – BalanceDF (e.g.: of self before adjustment). Defaults to None.

  • target (Optional["BalanceDF"], optional) – To compare against. Defaults to None.

Raises:
  • ValueError – If target is not linked in self and also not provided to the function.

  • ValueError – If unadjusted is not linked in self and also not provided to the function.

Returns:

The improvement is taking the (before_mean_asmd-after_mean_asmd)/before_mean_asmd.

The asmd is calculated using asmd().

Return type:

np.float64

Examples

import pandas as pd
from balance.sample_class import Sample

from copy import deepcopy


s1 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3, 1),
            "b": (-42, 8, 2, -42),
            "o": (7, 8, 9, 10),
            "c": ("x", "y", "z", "v"),
            "id": (1, 2, 3, 4),
            "w": (0.5, 2, 1, 1),
        }
    ),
    id_column="id",
    weight_column="w",
    outcome_columns="o",
)

s2 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3),
            "b": (4, 6, 8),
            "id": (1, 2, 3),
            "w": (0.5, 1, 2),
            "c": ("x", "y", "z"),
        }
    ),
    id_column="id",
    weight_column="w",
)

s3 = s1.set_target(s2)
s3_null = s3.adjust(method="null")

s3_null_madeup_weights = deepcopy(s3_null)
s3_null_madeup_weights.set_weights((1, 2, 3, 1))

s3_null.covars().asmd_improvement() # 0. since unadjusted is just a copy of self
s3_null_madeup_weights.covars().asmd_improvement() # 0.10698596233975825

asmd_df = s3_null_madeup_weights.covars().asmd()
print(asmd_df["mean(asmd)"])
    # source
    # self                 2.834932
    # unadjusted           3.174566
    # unadjusted - self    0.339634
    # Name: mean(asmd), dtype: float64
(asmd_df["mean(asmd)"][1] - asmd_df["mean(asmd)"][0]) / asmd_df["mean(asmd)"][1]  # 0.10698596233975825
# just like asmd_improvement
ci_of_mean(on_linked_samples: bool = True, **kwargs) DataFrame[source]

Calculates a confidence intervals of the weighted mean on the df of the BalanceDF object.

Parameters:
  • self (BalanceDF) – Object.

  • on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses _call_on_linked() with method “ci_of_mean”. If False, then uses _descriptive_stats() with method “ci_of_mean”.

  • kwargs – we can pass ci_of_mean arguments. E.g.: conf_level and round_ndigits.

Returns:

With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying model_matrix())

Return type:

pd.DataFrame

Examples

::

import pandas as pd from balance.sample_class import Sample from balance.stats_and_plots.weighted_stats import ci_of_weighted_mean

ci_of_weighted_mean(pd.Series((1, 2, 3, 1)), pd.Series((0.5, 2, 1, 1)), round_ndigits = 3) # 0 (1.232, 2.545) # dtype: object # This shows we got the first cell of ‘a’ as expected.

s1 = Sample.from_frame(
pd.DataFrame(
{

“a”: (1, 2, 3, 1), “b”: (-42, 8, 2, -42), “o”: (7, 8, 9, 10), “c”: (“x”, “y”, “z”, “v”), “id”: (1, 2, 3, 4), “w”: (0.5, 2, 1, 1),

}

), id_column=”id”, weight_column=”w”, outcome_columns=”o”,

)

s2 = Sample.from_frame(
pd.DataFrame(
{

“a”: (1, 2, 3), “b”: (4, 6, 8), “id”: (1, 2, 3), “w”: (0.5, 1, 2), “c”: (“x”, “y”, “z”),

}

), id_column=”id”, weight_column=”w”,

)

s3 = s1.set_target(s2) s3_null = s3.adjust(method=”null”)

print(s3_null.covars().ci_of_mean(round_ndigits = 3).T)

# source self target unadjusted # a (1.232, 2.545) (1.637, 3.221) (1.232, 2.545) # b (-32.715, 12.715) (5.273, 8.441) (-32.715, 12.715) # c[v] (-0.183, 0.627) NaN (-0.183, 0.627) # c[x] (-0.116, 0.338) (-0.156, 0.442) (-0.116, 0.338) # c[y] (-0.12, 1.009) (-0.233, 0.804) (-0.12, 1.009) # c[z] (-0.183, 0.627) (-0.027, 1.17) (-0.183, 0.627)

s3_2 = s1.set_target(s2) s3_null_2 = s3_2.adjust(method=”null”) print(s3_null_2.outcomes().ci_of_mean(round_ndigits = 3))

# o # source # self (7.671, 9.44) # unadjusted (7.671, 9.44)

property df: DataFrame

Get the df of the BalanceDF object.

The df is stored in the BalanceDF.__df object, that is set during the __init__ of the object.

Parameters:

self (BalanceDF) – The object.

Returns:

The df (this is __df, with no weights) from the BalanceDF object.

Return type:

pd.DataFrame

mean(on_linked_samples: bool = True, **kwargs) DataFrame[source]

Calculates a weighted mean on the df of the BalanceDF object.

Parameters:
  • self (BalanceDF) – Object.

  • on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses _call_on_linked() with method “mean”. If False, then uses _descriptive_stats() with method “mean”.

Returns:

With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying model_matrix())

Return type:

pd.DataFrame

Examples

import pandas as pd
from balance.sample_class import Sample

s1 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3, 1),
            "b": (-42, 8, 2, -42),
            "o": (7, 8, 9, 10),
            "c": ("x", "y", "z", "v"),
            "id": (1, 2, 3, 4),
            "w": (0.5, 2, 1, 1),
        }
    ),
    id_column="id",
    weight_column="w",
    outcome_columns="o",
)

s2 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3),
            "b": (4, 6, 8),
            "id": (1, 2, 3),
            "w": (0.5, 1, 2),
            "c": ("x", "y", "z"),
        }
    ),
    id_column="id",
    weight_column="w",
)

s3 = s1.set_target(s2)
s3_null = s3.adjust(method="null")

print(s3_null.covars().mean())

#                 a          b      c[v]      c[x]      c[y]      c[z]
# source
# self        1.888889 -10.000000  0.222222  0.111111  0.444444  0.222222
# target      2.428571   6.857143       NaN  0.142857  0.285714  0.571429
# unadjusted  1.888889 -10.000000  0.222222  0.111111  0.444444  0.222222
mean_with_ci(round_ndigits: int = 3, on_linked_samples: bool = True) DataFrame[source]

Returns a table with means and confidence intervals (CIs) for all elements in the BalanceDF object.

This method calculates the mean and CI for each column of the BalanceDF object using the BalanceDF.mean() and BalanceDF.ci_of_mean() methods, respectively. The resulting table contains (for each element such as self, target and adjust) two columns for each input column: one for the mean and one for the CI.

Parameters:
  • self (BalanceDF) – The BalanceDF object.

  • round_ndigits (int, optional) – The number of decimal places to round the mean and CI to. Defaults to 3.

  • on_linked_samples (bool, optional) – A boolean indicating whether to include linked samples when calculating the mean. Defaults to True.

Returns:

A table with two rows for each input column: one for the mean and one for the CI.

The columns of the table are labeled with the names of the input columns.

Return type:

pd.DataFrame

Examples

::

import numpy as np import pandas as pd

from balance.sample_class import Sample

s_o = Sample.from_frame(

pd.DataFrame({“o1”: (7, 8, 9, 10), “o2”: (7, 8, 9, np.nan), “id”: (1, 2, 3, 4)}), id_column=”id”, outcome_columns=(“o1”, “o2”),

)

t_o = Sample.from_frame(
pd.DataFrame(
{

“o1”: (7, 8, 9, 10, 11, 12, 13, 14), “o2”: (7, 8, 9, np.nan, np.nan, 12, 13, 14), “id”: (1, 2, 3, 4, 5, 6, 7, 8),

}

), id_column=”id”, outcome_columns=(“o1”, “o2”),

) s_o2 = s_o.set_target(t_o)

print(s_o2.outcomes().mean_with_ci())

# source self target self target # _is_na_o2[False] 0.75 0.750 (0.326, 1.174) (0.45, 1.05) # _is_na_o2[True] 0.25 0.250 (-0.174, 0.674) (-0.05, 0.55) # o1 8.50 10.500 (7.404, 9.596) (8.912, 12.088) # o2 6.00 7.875 (2.535, 9.465) (4.351, 11.399)

model_matrix() DataFrame[source]

Return a model_matrix version of the df inside the BalanceDF object using balance_util.model_matrix

This can be used to turn all character columns into a one hot encoding columns.

Parameters:

self (BalanceDF) – Object

Returns:

The output from balance_util.model_matrix()

Return type:

pd.DataFrame

Examples

import pandas as pd
from balance.sample_class import Sample

s1 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3, 1),
            "b": (-42, 8, 2, -42),
            "o": (7, 8, 9, 10),
            "c": ("x", "y", "z", "v"),
            "id": (1, 2, 3, 4),
            "w": (0.5, 2, 1, 1),
        }
    ),
    id_column="id",
    weight_column="w",
    outcome_columns="o",
)

print(s1.covars().df)
    # a   b  c
    # 0  1 -42  x
    # 1  2   8  y
    # 2  3   2  z
    # 3  1 -42  v

print(s1.covars().model_matrix())
    #      a     b  c[v]  c[x]  c[y]  c[z]
    # 0  1.0 -42.0   0.0   1.0   0.0   0.0
    # 1  2.0   8.0   0.0   0.0   1.0   0.0
    # 2  3.0   2.0   0.0   0.0   0.0   1.0
    # 3  1.0 -42.0   1.0   0.0   0.0   0.0
names() List[source]

Returns the column names of the DataFrame (df) inside a BalanceDF object.

Parameters:

self (BalanceDF) – The object.

Returns:

Of column names.

Return type:

List

Examples

s1 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3, 1),
            "b": (-42, 8, 2, -42),
            "o": (7, 8, 9, 10),
            "c": ("x", "y", "z", "v"),
            "id": (1, 2, 3, 4),
            "w": (0.5, 2, 1, 1),
        }
    ),
    id_column="id",
    weight_column="w",
    outcome_columns="o",
)

s1.covars().names()
# ['a', 'b', 'c']
s1.weights().names()
# ['w']
s1.outcomes().names()
# ['o']
plot(on_linked_samples: bool = True, **kwargs) List | ndarray[Any, dtype[ScalarType]] | Dict[str, Figure] | None[source]

Plots the variables in the df of the BalanceDF object.

See weighted_comparisons_plots.plot_dist() for details of various arguments that can be passed. The default plotting engine is plotly, but seaborn can be used for static plots.

This function is inherited as is when invoking BalanceCovarsDF.plot, but some modifications are made when preparing the data for BalanceOutcomesDF.plot and BalanceWeightsDF.plot.

Parameters:
  • self (BalanceDF) – Object (used in the plots as “sample” or “self”)

  • on_linked_samples (bool, optional) – Determines if the linked samples should be included in the plot. Defaults to True.

  • **kwargs – passed to weighted_comparisons_plots.plot_dist().

Returns:

If library=”plotly” then returns a dictionary containing plots if return_dict_of_figures is True. None otherwise. If library=”seaborn” then returns None, unless return_axes is True. Then either a list or an np.array of matplotlib axis.

Return type:

Union[Union[List, np.ndarray], Dict[str, go.Figure], None]

Examples

import numpy as np
import pandas as pd
from numpy import random
from balance.sample_class import Sample

random.seed(96483)

df = pd.DataFrame({
    "id": range(100),
    'v1': random.random_integers(11111, 11114, size=100).astype(str),
    'v2': random.normal(size = 100),
    'v3': random.uniform(size = 100),
    "w": pd.Series(np.ones(99).tolist() + [1000]),
}).sort_values(by=['v2'])

s1 = Sample.from_frame(df,
    id_column="id",
    weight_column="w",
)

s2 = Sample.from_frame(
    df.assign(w = pd.Series(np.ones(100))),
    id_column="id",
    weight_column="w",
)

s3 = s1.set_target(s2)
s3_null = s3.adjust(method="null")
s3_null.set_weights(random.random(size = 100) + 0.5)

s3_null.covars().plot()
s3_null.covars().plot(library = "seaborn")

# Controlling the limits of the y axis using lim:
s3_null.covars().plot(ylim = (0,1))
s3_null.covars().plot(library = "seaborn",ylim = (0,1), dist_type = "hist")

# Returning plotly qq plots:
s3_null.covars().plot(dist_type = "qq")
std(on_linked_samples: bool = True, **kwargs) DataFrame[source]

Calculates a weighted std on the df of the BalanceDF object.

Parameters:
  • self (BalanceDF) – Object.

  • on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses _call_on_linked() with method “std”. If False, then uses _descriptive_stats() with method “std”.

Returns:

With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying model_matrix())

Return type:

pd.DataFrame

Examples

import pandas as pd
from balance.sample_class import Sample

s1 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3, 1),
            "b": (-42, 8, 2, -42),
            "o": (7, 8, 9, 10),
            "c": ("x", "y", "z", "v"),
            "id": (1, 2, 3, 4),
            "w": (0.5, 2, 1, 1),
        }
    ),
    id_column="id",
    weight_column="w",
    outcome_columns="o",
)

s2 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3),
            "b": (4, 6, 8),
            "id": (1, 2, 3),
            "w": (0.5, 1, 2),
            "c": ("x", "y", "z"),
        }
    ),
    id_column="id",
    weight_column="w",
)

s3 = s1.set_target(s2)
s3_null = s3.adjust(method="null")

print(s3_null.covars().std())

    #                 a          b  c[v]      c[x]      c[y]      c[z]
    # source
    # self        0.886405  27.354812   0.5  0.377964  0.597614  0.500000
    # target      0.963624   1.927248   NaN  0.462910  0.597614  0.654654
    # unadjusted  0.886405  27.354812   0.5  0.377964  0.597614  0.500000
summary(on_linked_samples: bool = True) DataFrame | str[source]

Returns a summary of the BalanceDF object.

This method currently calculates the mean and confidence interval (CI) for each column of the object using the BalanceDF.mean_with_ci() method. In the future, this method may be extended to include additional summary statistics.

Parameters:
  • self (BalanceDF) – The BalanceDF object.

  • on_linked_samples (bool, optional) – A boolean indicating whether to include linked samples when calculating the mean and CI. Defaults to True.

Returns:

A table with two rows for each input column: one for the mean and one for the CI.

The columns of the table are labeled with the names of the input columns.

Return type:

Union[pd.DataFrame, str]

to_csv(path_or_buf: str | Path | IO | None = None, *args, **kwargs) str | None[source]

Write df with ids from BalanceDF to a comma-separated values (csv) file.

Uses pd.DataFrame.to_csv().

If an ‘index’ argument is not provided then it defaults to False.

Parameters:
  • self (BalanceDF) – Object.

  • path_or_buf (Optional[FilePathOrBuffer], optional) – location where to save the csv.

Returns:

If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

Return type:

Optional[str]

to_download(tempdir: str | None = None) FileLink[source]

Creates a downloadable link of the DataFrame, with ids, of the BalanceDF object.

File name starts with tmp_balance_out_, and some random file name (using uuid.uuid4()).

Parameters:
  • self (BalanceDF) – Object.

  • tempdir (Optional[str], optional) – Defaults to None (which then uses a temporary folder using tempfile.gettempdir()).

Returns:

Embedding a local file link in an IPython session, based on path. Using :func:FileLink.

Return type:

FileLink

var_of_mean(on_linked_samples: bool = True, **kwargs) DataFrame[source]

Calculates a variance of the weighted mean on the df of the BalanceDF object.

Parameters:
  • self (BalanceDF) – Object.

  • on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses _call_on_linked() with method “var_of_mean”. If False, then uses _descriptive_stats() with method “var_of_mean”.

Returns:

With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying model_matrix())

Return type:

pd.DataFrame

Examples

::

import pandas as pd from balance.sample_class import Sample from balance.stats_and_plots.weighted_stats import var_of_weighted_mean

var_of_weighted_mean(pd.Series((1, 2, 3, 1)), pd.Series((0.5, 2, 1, 1)))

# 0 0.112178 # dtype: float64

# This shows we got the first cell of ‘a’ as expected.

s1 = Sample.from_frame(
pd.DataFrame(
{

“a”: (1, 2, 3, 1), “b”: (-42, 8, 2, -42), “o”: (7, 8, 9, 10), “c”: (“x”, “y”, “z”, “v”), “id”: (1, 2, 3, 4), “w”: (0.5, 2, 1, 1),

}

), id_column=”id”, weight_column=”w”, outcome_columns=”o”,

)

s2 = Sample.from_frame(
pd.DataFrame(
{

“a”: (1, 2, 3), “b”: (4, 6, 8), “id”: (1, 2, 3), “w”: (0.5, 1, 2), “c”: (“x”, “y”, “z”),

}

), id_column=”id”, weight_column=”w”,

)

s3 = s1.set_target(s2) s3_null = s3.adjust(method=”null”)

print(s3_null.covars().var_of_mean())

# a b c[v] c[x] c[y] c[z] # source # self 0.112178 134.320988 0.042676 0.013413 0.082914 0.042676 # target 0.163265 0.653061 NaN 0.023324 0.069971 0.093294 # unadjusted 0.112178 134.320988 0.042676 0.013413 0.082914 0.042676

class balance.balancedf_class.BalanceOutcomesDF(sample: Sample)[source]
relative_response_rates(target: bool | DataFrame = False, per_column: bool = False) DataFrame | None[source]

Produces a summary table of number of responses and proportion of completed responses.

See general_stats.relative_response_rates().

Parameters:
  • self (BalanceOutcomesDF) – Object

  • target (Union[bool, pd.DataFrame], optional) –

    Defaults to False. Determines what is passed to df_target in general_stats.relative_response_rates() If False: passes None. If True: passes the df from the target of sample (notice, it’s the df of target, NOT target.outcome().df).

    So it means it will count only rows that are all notnull rows (so if the target has covars and outcomes,

    both will need to be notnull to be counted).

    If you want to control this in a more specific way, pass pd.DataFrame instead.

    If pd.DataFrame: passes it as is.

  • per_column (bool, optional) – Default is False. See general_stats.relative_response_rates().

Returns:

A column per outcome, and two rows.

One row with number of non-null observations, and A second row with the proportion of non-null observations.

If ‘target’ is set to True but there is no target, the function returns None.

Return type:

Optional[pd.DataFrame]

Examples

import numpy as np
import pandas as pd

from balance.sample_class import Sample


s_o = Sample.from_frame(
    pd.DataFrame({"o1": (7, 8, 9, 10), "o2": (7, 8, 9, np.nan), "id": (1, 2, 3, 4)}),
    id_column="id",
    outcome_columns=("o1", "o2"),
)

print(s_o.outcomes().relative_response_rates())
    #       o1    o2
    # n    4.0   3.0
    # %  100.0  75.0

s_o.outcomes().relative_response_rates(target = True)
# None

# compared with a larger target

t_o = Sample.from_frame(
    pd.DataFrame(
        {
            "o1": (7, 8, 9, 10, 11, 12, 13, 14),
            "o2": (7, 8, 9, np.nan, np.nan, 12, 13, 14),
            "id": (1, 2, 3, 4, 5, 6, 7, 8),
        }
    ),
    id_column="id",
    outcome_columns=("o1", "o2"),
)
s_o2 = s_o.set_target(t_o)

print(s_o2.outcomes().relative_response_rates(True, per_column = True))
    #     o1    o2
    # n   4.0   3.0
    # %  50.0  50.0

df_target = pd.DataFrame(
        {
            "o1": (7, 8, 9, 10, 11, 12, 13, 14),
            "o2": (7, 8, 9, np.nan, np.nan, 12, 13, 14),
        }
    )

print(s_o2.outcomes().relative_response_rates(target = df_target, per_column = True))
    #     o1    o2
    # n   4.0   3.0
    # %  50.0  50.0
summary(on_linked_samples: bool | None = None) str[source]

Produces summary printable string of a BalanceOutcomesDF object.

Parameters:
  • self (BalanceOutcomesDF) – Object.

  • on_linked_samples (Optional[bool]) – Ignored. Only here since summary overrides BalanceDF.summary.

Returns:

A printable string, with mean of outcome variables and response rates.

Return type:

str

Examples

import numpy as np
import pandas as pd

from balance.sample_class import Sample

s_o = Sample.from_frame(
    pd.DataFrame({"o1": (7, 8, 9, 10), "o2": (7, 8, 9, np.nan), "id": (1, 2, 3, 4)}),
    id_column="id",
    outcome_columns=("o1", "o2"),
)

t_o = Sample.from_frame(
    pd.DataFrame(
        {
            "o1": (7, 8, 9, 10, 11, 12, 13, 14),
            "o2": (7, 8, 9, np.nan, np.nan, 12, 13, 14),
            "id": (1, 2, 3, 4, 5, 6, 7, 8),
        }
    ),
    id_column="id",
    outcome_columns=("o1", "o2"),
)
s_o2 = s_o.set_target(t_o)

print(s_o.outcomes().summary())
    # 2 outcomes: ['o1' 'o2']
    # Mean outcomes (with 95% confidence intervals):
    # source            self             self
    # _is_na_o2[False]  0.75   (0.326, 1.174)
    # _is_na_o2[True]   0.25  (-0.174, 0.674)
    # o1                8.50   (7.404, 9.596)
    # o2                6.00   (2.535, 9.465)

    # Response rates (relative to number of respondents in sample):
    #       o1    o2
    # n    4.0   3.0
    # %  100.0  75.0

print(s_o2.outcomes().summary())
    # 2 outcomes: ['o1' 'o2']
    # Mean outcomes (with 95% confidence intervals):
    # source            self  target             self           target
    # _is_na_o2[False]  0.75   0.750   (0.326, 1.174)     (0.45, 1.05)
    # _is_na_o2[True]   0.25   0.250  (-0.174, 0.674)    (-0.05, 0.55)
    # o1                8.50  10.500   (7.404, 9.596)  (8.912, 12.088)
    # o2                6.00   7.875   (2.535, 9.465)  (4.351, 11.399)

    # Response rates (relative to number of respondents in sample):
    #       o1    o2
    # n    4.0   3.0
    # %  100.0  75.0
    # Response rates (relative to notnull rows in the target):
    #            o1    o2
    # n   4.000000   3.0
    # %  66.666667  50.0
    # Response rates (in the target):
    #        o1    o2
    # n    8.0   6.0
    # %  100.0  75.0
target_response_rates() DataFrame | None[source]

Calculates relative_response_rates for the target in a Sample object.

See general_stats.relative_response_rates().

Parameters:

self (BalanceOutcomesDF) – Object (with/without a target set)

Returns:

None if the object doesn’t have a target.

If the object has a target, it returns the output of general_stats.relative_response_rates().

Return type:

Optional[pd.DataFrame]

Examples

import numpy as np
import pandas as pd

from balance.sample_class import Sample


s_o = Sample.from_frame(
    pd.DataFrame({"o1": (7, 8, 9, 10), "o2": (7, 8, 9, np.nan), "id": (1, 2, 3, 4)}),
    id_column="id",
    outcome_columns=("o1", "o2"),
)

t_o = Sample.from_frame(
    pd.DataFrame(
        {
            "o1": (7, 8, 9, 10, 11, 12, 13, 14),
            "o2": (7, 8, 9, np.nan, 11, 12, 13, 14),
            "id": (1, 2, 3, 4, 5, 6, 7, 8),
        }
    ),
    id_column="id",
    outcome_columns=("o1", "o2"),
)
s_o = s_o.set_target(t_o)

print(s_o.outcomes().target_response_rates())
    #       o1    o2
    # n    8.0   7.0
    # %  100.0  87.5
class balance.balancedf_class.BalanceWeightsDF(sample: Sample)[source]
design_effect() float64[source]

Calculates Kish’s design effect (deff) on the BalanceWeightsDF weights.

Extract the first column to get a pd.Series of the weights.

See weights_stats.design_effect() for details.

Parameters:

self (BalanceWeightsDF) – Object.

Returns:

Deff.

Return type:

np.float64

plot(on_linked_samples: bool = True, **kwargs) List | ndarray[Any, dtype[ScalarType]] | Dict[str, Figure] | None[source]

Plots kde (kernal density estimation) of the weights in a BalanceWeightsDF object using seaborn (as default).

It’s possible to use other plots using dist_type with arguments such as “hist”, “kde” (default), “qq”, and “ecdf”. Look at plot_dist() for more details.

Parameters:
  • self (BalanceWeightsDF) – a BalanceOutcomesDF object, with a set of variables.

  • on_linked_samples (bool, optional) – Determines if the linked samples should be included in the plot. Defaults to True.

Returns:

If library=”plotly” then returns a dictionary containing plots if return_dict_of_figures is True. None otherwise. If library=”seaborn” then returns None, unless return_axes is True. Then either a list or an np.array of matplotlib axis.

Return type:

Union[Union[List, np.ndarray], Dict[str, go.Figure], None]

Examples

import numpy as np
import pandas as pd
from numpy import random
from balance.sample_class import Sample

random.seed(96483)

df = pd.DataFrame({
    "id": range(100),
    'v1': random.random_integers(11111, 11114, size=100).astype(str),
    'v2': random.normal(size = 100),
    'v3': random.uniform(size = 100),
    "w": pd.Series(np.ones(99).tolist() + [1000]),
}).sort_values(by=['v2'])

s1 = Sample.from_frame(df,
    id_column="id",
    weight_column="w",
    outcome_columns=["v1", "v2"],
)

s2 = Sample.from_frame(
    df.assign(w = pd.Series(np.ones(100))),
    id_column="id",
    weight_column="w",
    outcome_columns=["v1", "v2"],
)

s3 = s1.set_target(s2)
s3_null = s3.adjust(method="null")
s3_null.set_weights(random.random(size = 100) + 0.5)

# default: seaborn with dist_type = "kde"
s3_null.weights().plot()
summary(on_linked_samples: bool | None = None) DataFrame[source]

Generates a summary of a BalanceWeightsDF object.

This function provides a comprehensive overview of the BalanceWeightsDF object by calculating and returning a range of weight diagnostics.

Parameters:
  • self (BalanceWeightsDF) – The BalanceWeightsDF object to be summarized.

  • on_linked_samples (Optional[bool], optional) – This parameter is ignored. It is only included because summary overrides BalanceDF.summary. Defaults to None.

Returns:

A DataFrame containing various weight diagnostics such as ‘design_effect’,

’effective_sample_proportion’, ‘effective_sample_size’, sum of weights, and basic summary statistics from describe, ‘nonparametric_skew’, and ‘weighted_median_breakdown_point’ among others.

Return type:

pd.DataFrame

Note

The weights are normalized to sum to the sample size, n.

Examples

import pandas as pd
from balance.sample_class import Sample

s1 = Sample.from_frame(
    pd.DataFrame(
        {
            "a": (1, 2, 3, 1),
            "b": (-42, 8, 2, -42),
            "o": (7, 8, 9, 10),
            "c": ("x", "y", "z", "v"),
            "id": (1, 2, 3, 4),
            "w": (0.5, 2, 1, 1),
        }
    ),
    id_column="id",
    weight_column="w",
    outcome_columns="o",
)

print(s1.weights().summary().round(2))
    #                                 var   val
    # 0                     design_effect  1.23
    # 1       effective_sample_proportion  0.81
    # 2             effective_sample_size  3.24
    # 3                               sum  4.50
    # 4                    describe_count  4.00
    # 5                     describe_mean  1.00
    # 6                      describe_std  0.56
    # 7                      describe_min  0.44
    # 8                      describe_25%  0.78
    # 9                      describe_50%  0.89
    # 10                     describe_75%  1.11
    # 11                     describe_max  1.78
    # 12                    prop(w < 0.1)  0.00
    # 13                    prop(w < 0.2)  0.00
    # 14                  prop(w < 0.333)  0.00
    # 15                    prop(w < 0.5)  0.25
    # 16                      prop(w < 1)  0.75
    # 17                     prop(w >= 1)  0.25
    # 18                     prop(w >= 2)  0.00
    # 19                     prop(w >= 3)  0.00
    # 20                     prop(w >= 5)  0.00
    # 21                    prop(w >= 10)  0.00
    # 22               nonparametric_skew  0.20
    # 23  weighted_median_breakdown_point  0.25
trim(ratio: float | int | None = None, percentile: float | None = None, keep_sum_of_weights: bool = True) None[source]

Trim weights in the sample object.

Uses adjustments.trim_weights() for the weights trimming.

Parameters:
  • self (BalanceWeightsDF) – Object.

  • ratio (Optional[Union[float, int]], optional) – Maps to weight_trimming_mean_ratio. Defaults to None.

  • percentile (Optional[float], optional) – Maps to weight_trimming_percentile. Defaults to None.

  • keep_sum_of_weights (bool, optional) – Maps to weight_trimming_percentile. Defaults to True.

Returns:

None. This function updates the _sample() using set_weights()