balance.balancedf_class¶
- class balance.balancedf_class.BalanceCovarsDF(sample: Sample)[source]¶
- from_frame(df: DataFrame, weights=typing.Optional[pandas.core.series.Series]) BalanceCovarsDF [source]¶
A factory function to create a BalanceCovarsDF from a df.
Although generally the main way the object is created is through the __init__ method.
- Parameters:
self (BalanceCovarsDF) – Object
df (pd.DataFrame) – A df.
weights (Optional[pd.Series], optional) – _description_. Defaults to None.
- Returns:
Object.
- Return type:
- class balance.balancedf_class.BalanceDF(df: DataFrame, sample: Sample, name: Literal['outcomes', 'weights', 'covars'])[source]¶
Wrapper class around a Sample which provides additional balance-specific functionality
- asmd(on_linked_samples: bool = True, target: BalanceDF | None = None, aggregate_by_main_covar: bool = False, **kwargs) DataFrame [source]¶
ASMD is the absolute difference of the means of two groups (say, P and T), divided by some standard deviation (std). It can be std of P or of T, or of P and T. These are all variations on the absolute value of cohen’s d (see: https://en.wikipedia.org/wiki/Effect_size#Cohen’s_d).
We can use asmd to compares multiple Samples (with and without adjustment) to a target population.
- Parameters:
self (BalanceDF) – Object from sample (with/without adjustment, but it needs some target)
on_linked_samples (bool, optional) – If to compare also to linked sample objects (specifically: unadjusted), or not. Defaults to True.
target (Optional["BalanceDF"], optional) – A BalanceDF (of the same type as the one used in self) to compare against. If None then it looks for a target in the self linked objects. Defaults to None.
aggregate_by_main_covar (bool, optional) – Defaults to False. If True, it will make sure to return the asmd DataFrame after averaging all the columns from using the one-hot encoding for categorical variables. See ::_aggregate_asmd_by_main_covar:: for more details.
- Raises:
ValueError – If self has no target and no target is supplied.
- Returns:
If on_linked_samples is False, then only one row (index name depends on BalanceDF type, e.g.: covars), with asmd of self vs the target (depending if it’s covars, or something else). If on_linked_samples is True, then two rows per source (self, unadjusted), each with the asmd compared to target, and a third row for the difference (self-unadjusted).
- Return type:
pd.DataFrame
Examples
import pandas as pd from balance.sample_class import Sample from copy import deepcopy s1 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3, 1), "b": (-42, 8, 2, -42), "o": (7, 8, 9, 10), "c": ("x", "y", "z", "v"), "id": (1, 2, 3, 4), "w": (0.5, 2, 1, 1), } ), id_column="id", weight_column="w", outcome_columns="o", ) s2 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3), "b": (4, 6, 8), "id": (1, 2, 3), "w": (0.5, 1, 2), "c": ("x", "y", "z"), } ), id_column="id", weight_column="w", ) s3 = s1.set_target(s2) s3_null = s3.adjust(method="null") s3_null_madeup_weights = deepcopy(s3_null) s3_null_madeup_weights.set_weights((1, 2, 3, 1)) print(s3_null.covars().asmd().round(3)) # a b c[v] c[x] c[y] c[z] mean(asmd) # source # self 0.56 8.747 NaN 0.069 0.266 0.533 3.175 # unadjusted 0.56 8.747 NaN 0.069 0.266 0.533 3.175 # unadjusted - self 0.00 0.000 NaN 0.000 0.000 0.000 0.000 # show that on_linked_samples = False works: print(s3_null.covars().asmd(on_linked_samples = False).round(3)) # a b c[v] c[x] c[y] c[z] mean(asmd) # index # covars 0.56 8.747 NaN 0.069 0.266 0.533 3.175 # verify this also works when we have some weights print(s3_null_madeup_weights.covars().asmd()) # a b c[v] ... c[y] c[z] mean(asmd) # source ... # self 0.296500 8.153742 NaN ... 0.000000 0.218218 2.834932 # unadjusted 0.560055 8.746742 NaN ... 0.265606 0.533422 3.174566 # unadjusted - self 0.263555 0.592999 NaN ... 0.265606 0.315204 0.33963
- asmd_improvement(unadjusted: BalanceDF | None = None, target: BalanceDF | None = None) float64 [source]¶
Calculates the improvement in mean(asmd) from before to after applying some weight adjustment.
See
weighted_comparisons_stats.asmd_improvement()
for details.- Parameters:
self (BalanceDF) – BalanceDF (e.g.: of self after adjustment)
unadjusted (Optional["BalanceDF"], optional) – BalanceDF (e.g.: of self before adjustment). Defaults to None.
target (Optional["BalanceDF"], optional) – To compare against. Defaults to None.
- Raises:
ValueError – If target is not linked in self and also not provided to the function.
ValueError – If unadjusted is not linked in self and also not provided to the function.
- Returns:
- The improvement is taking the (before_mean_asmd-after_mean_asmd)/before_mean_asmd.
The asmd is calculated using
asmd()
.
- Return type:
np.float64
Examples
import pandas as pd from balance.sample_class import Sample from copy import deepcopy s1 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3, 1), "b": (-42, 8, 2, -42), "o": (7, 8, 9, 10), "c": ("x", "y", "z", "v"), "id": (1, 2, 3, 4), "w": (0.5, 2, 1, 1), } ), id_column="id", weight_column="w", outcome_columns="o", ) s2 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3), "b": (4, 6, 8), "id": (1, 2, 3), "w": (0.5, 1, 2), "c": ("x", "y", "z"), } ), id_column="id", weight_column="w", ) s3 = s1.set_target(s2) s3_null = s3.adjust(method="null") s3_null_madeup_weights = deepcopy(s3_null) s3_null_madeup_weights.set_weights((1, 2, 3, 1)) s3_null.covars().asmd_improvement() # 0. since unadjusted is just a copy of self s3_null_madeup_weights.covars().asmd_improvement() # 0.10698596233975825 asmd_df = s3_null_madeup_weights.covars().asmd() print(asmd_df["mean(asmd)"]) # source # self 2.834932 # unadjusted 3.174566 # unadjusted - self 0.339634 # Name: mean(asmd), dtype: float64 (asmd_df["mean(asmd)"][1] - asmd_df["mean(asmd)"][0]) / asmd_df["mean(asmd)"][1] # 0.10698596233975825 # just like asmd_improvement
- ci_of_mean(on_linked_samples: bool = True, **kwargs) DataFrame [source]¶
Calculates a confidence intervals of the weighted mean on the df of the BalanceDF object.
- Parameters:
self (BalanceDF) – Object.
on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses
_call_on_linked()
with method “ci_of_mean”. If False, then uses_descriptive_stats()
with method “ci_of_mean”.kwargs – we can pass ci_of_mean arguments. E.g.: conf_level and round_ndigits.
- Returns:
With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying
model_matrix()
)- Return type:
pd.DataFrame
Examples
- ::
import pandas as pd from balance.sample_class import Sample from balance.stats_and_plots.weighted_stats import ci_of_weighted_mean
ci_of_weighted_mean(pd.Series((1, 2, 3, 1)), pd.Series((0.5, 2, 1, 1)), round_ndigits = 3) # 0 (1.232, 2.545) # dtype: object # This shows we got the first cell of ‘a’ as expected.
- s1 = Sample.from_frame(
- pd.DataFrame(
- {
“a”: (1, 2, 3, 1), “b”: (-42, 8, 2, -42), “o”: (7, 8, 9, 10), “c”: (“x”, “y”, “z”, “v”), “id”: (1, 2, 3, 4), “w”: (0.5, 2, 1, 1),
}
), id_column=”id”, weight_column=”w”, outcome_columns=”o”,
)
- s2 = Sample.from_frame(
- pd.DataFrame(
- {
“a”: (1, 2, 3), “b”: (4, 6, 8), “id”: (1, 2, 3), “w”: (0.5, 1, 2), “c”: (“x”, “y”, “z”),
}
), id_column=”id”, weight_column=”w”,
)
s3 = s1.set_target(s2) s3_null = s3.adjust(method=”null”)
- print(s3_null.covars().ci_of_mean(round_ndigits = 3).T)
# source self target unadjusted # a (1.232, 2.545) (1.637, 3.221) (1.232, 2.545) # b (-32.715, 12.715) (5.273, 8.441) (-32.715, 12.715) # c[v] (-0.183, 0.627) NaN (-0.183, 0.627) # c[x] (-0.116, 0.338) (-0.156, 0.442) (-0.116, 0.338) # c[y] (-0.12, 1.009) (-0.233, 0.804) (-0.12, 1.009) # c[z] (-0.183, 0.627) (-0.027, 1.17) (-0.183, 0.627)
s3_2 = s1.set_target(s2) s3_null_2 = s3_2.adjust(method=”null”) print(s3_null_2.outcomes().ci_of_mean(round_ndigits = 3))
# o # source # self (7.671, 9.44) # unadjusted (7.671, 9.44)
- property df: DataFrame¶
Get the df of the BalanceDF object.
The df is stored in the BalanceDF.__df object, that is set during the __init__ of the object.
- Parameters:
self (BalanceDF) – The object.
- Returns:
The df (this is __df, with no weights) from the BalanceDF object.
- Return type:
pd.DataFrame
- mean(on_linked_samples: bool = True, **kwargs) DataFrame [source]¶
Calculates a weighted mean on the df of the BalanceDF object.
- Parameters:
self (BalanceDF) – Object.
on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses
_call_on_linked()
with method “mean”. If False, then uses_descriptive_stats()
with method “mean”.
- Returns:
With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying
model_matrix()
)- Return type:
pd.DataFrame
Examples
import pandas as pd from balance.sample_class import Sample s1 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3, 1), "b": (-42, 8, 2, -42), "o": (7, 8, 9, 10), "c": ("x", "y", "z", "v"), "id": (1, 2, 3, 4), "w": (0.5, 2, 1, 1), } ), id_column="id", weight_column="w", outcome_columns="o", ) s2 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3), "b": (4, 6, 8), "id": (1, 2, 3), "w": (0.5, 1, 2), "c": ("x", "y", "z"), } ), id_column="id", weight_column="w", ) s3 = s1.set_target(s2) s3_null = s3.adjust(method="null") print(s3_null.covars().mean()) # a b c[v] c[x] c[y] c[z] # source # self 1.888889 -10.000000 0.222222 0.111111 0.444444 0.222222 # target 2.428571 6.857143 NaN 0.142857 0.285714 0.571429 # unadjusted 1.888889 -10.000000 0.222222 0.111111 0.444444 0.222222
- mean_with_ci(round_ndigits: int = 3, on_linked_samples: bool = True) DataFrame [source]¶
Returns a table with means and confidence intervals (CIs) for all elements in the BalanceDF object.
This method calculates the mean and CI for each column of the BalanceDF object using the BalanceDF.mean() and BalanceDF.ci_of_mean() methods, respectively. The resulting table contains (for each element such as self, target and adjust) two columns for each input column: one for the mean and one for the CI.
- Parameters:
self (BalanceDF) – The BalanceDF object.
round_ndigits (int, optional) – The number of decimal places to round the mean and CI to. Defaults to 3.
on_linked_samples (bool, optional) – A boolean indicating whether to include linked samples when calculating the mean. Defaults to True.
- Returns:
- A table with two rows for each input column: one for the mean and one for the CI.
The columns of the table are labeled with the names of the input columns.
- Return type:
pd.DataFrame
Examples
- ::
import numpy as np import pandas as pd
from balance.sample_class import Sample
- s_o = Sample.from_frame(
pd.DataFrame({“o1”: (7, 8, 9, 10), “o2”: (7, 8, 9, np.nan), “id”: (1, 2, 3, 4)}), id_column=”id”, outcome_columns=(“o1”, “o2”),
)
- t_o = Sample.from_frame(
- pd.DataFrame(
- {
“o1”: (7, 8, 9, 10, 11, 12, 13, 14), “o2”: (7, 8, 9, np.nan, np.nan, 12, 13, 14), “id”: (1, 2, 3, 4, 5, 6, 7, 8),
}
), id_column=”id”, outcome_columns=(“o1”, “o2”),
) s_o2 = s_o.set_target(t_o)
- print(s_o2.outcomes().mean_with_ci())
# source self target self target # _is_na_o2[False] 0.75 0.750 (0.326, 1.174) (0.45, 1.05) # _is_na_o2[True] 0.25 0.250 (-0.174, 0.674) (-0.05, 0.55) # o1 8.50 10.500 (7.404, 9.596) (8.912, 12.088) # o2 6.00 7.875 (2.535, 9.465) (4.351, 11.399)
- model_matrix() DataFrame [source]¶
Return a model_matrix version of the df inside the BalanceDF object using balance_util.model_matrix
This can be used to turn all character columns into a one hot encoding columns.
- Parameters:
self (BalanceDF) – Object
- Returns:
The output from
balance_util.model_matrix()
- Return type:
pd.DataFrame
Examples
import pandas as pd from balance.sample_class import Sample s1 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3, 1), "b": (-42, 8, 2, -42), "o": (7, 8, 9, 10), "c": ("x", "y", "z", "v"), "id": (1, 2, 3, 4), "w": (0.5, 2, 1, 1), } ), id_column="id", weight_column="w", outcome_columns="o", ) print(s1.covars().df) # a b c # 0 1 -42 x # 1 2 8 y # 2 3 2 z # 3 1 -42 v print(s1.covars().model_matrix()) # a b c[v] c[x] c[y] c[z] # 0 1.0 -42.0 0.0 1.0 0.0 0.0 # 1 2.0 8.0 0.0 0.0 1.0 0.0 # 2 3.0 2.0 0.0 0.0 0.0 1.0 # 3 1.0 -42.0 1.0 0.0 0.0 0.0
- names() List [source]¶
Returns the column names of the DataFrame (df) inside a BalanceDF object.
- Parameters:
self (BalanceDF) – The object.
- Returns:
Of column names.
- Return type:
List
Examples
s1 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3, 1), "b": (-42, 8, 2, -42), "o": (7, 8, 9, 10), "c": ("x", "y", "z", "v"), "id": (1, 2, 3, 4), "w": (0.5, 2, 1, 1), } ), id_column="id", weight_column="w", outcome_columns="o", ) s1.covars().names() # ['a', 'b', 'c'] s1.weights().names() # ['w'] s1.outcomes().names() # ['o']
- plot(on_linked_samples: bool = True, **kwargs) List | ndarray[Any, dtype[_ScalarType_co]] | Dict[str, Figure] | None [source]¶
Plots the variables in the df of the BalanceDF object.
See
weighted_comparisons_plots.plot_dist()
for details of various arguments that can be passed. The default plotting engine is plotly, but seaborn can be used for static plots.This function is inherited as is when invoking BalanceCovarsDF.plot, but some modifications are made when preparing the data for BalanceOutcomesDF.plot and BalanceWeightsDF.plot.
- Parameters:
self (BalanceDF) – Object (used in the plots as “sample” or “self”)
on_linked_samples (bool, optional) – Determines if the linked samples should be included in the plot. Defaults to True.
**kwargs – passed to
weighted_comparisons_plots.plot_dist()
.
- Returns:
If library=”plotly” then returns a dictionary containing plots if return_dict_of_figures is True. None otherwise. If library=”seaborn” then returns None, unless return_axes is True. Then either a list or an np.array of matplotlib axis.
- Return type:
Union[Union[List, np.ndarray], Dict[str, go.Figure], None]
Examples
import numpy as np import pandas as pd from numpy import random from balance.sample_class import Sample random.seed(96483) df = pd.DataFrame({ "id": range(100), 'v1': random.random_integers(11111, 11114, size=100).astype(str), 'v2': random.normal(size = 100), 'v3': random.uniform(size = 100), "w": pd.Series(np.ones(99).tolist() + [1000]), }).sort_values(by=['v2']) s1 = Sample.from_frame(df, id_column="id", weight_column="w", ) s2 = Sample.from_frame( df.assign(w = pd.Series(np.ones(100))), id_column="id", weight_column="w", ) s3 = s1.set_target(s2) s3_null = s3.adjust(method="null") s3_null.set_weights(random.random(size = 100) + 0.5) s3_null.covars().plot() s3_null.covars().plot(library = "seaborn") # Controlling the limits of the y axis using lim: s3_null.covars().plot(ylim = (0,1)) s3_null.covars().plot(library = "seaborn",ylim = (0,1), dist_type = "hist") # Returning plotly qq plots: s3_null.covars().plot(dist_type = "qq")
- std(on_linked_samples: bool = True, **kwargs) DataFrame [source]¶
Calculates a weighted std on the df of the BalanceDF object.
- Parameters:
self (BalanceDF) – Object.
on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses
_call_on_linked()
with method “std”. If False, then uses_descriptive_stats()
with method “std”.
- Returns:
With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying
model_matrix()
)- Return type:
pd.DataFrame
Examples
import pandas as pd from balance.sample_class import Sample s1 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3, 1), "b": (-42, 8, 2, -42), "o": (7, 8, 9, 10), "c": ("x", "y", "z", "v"), "id": (1, 2, 3, 4), "w": (0.5, 2, 1, 1), } ), id_column="id", weight_column="w", outcome_columns="o", ) s2 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3), "b": (4, 6, 8), "id": (1, 2, 3), "w": (0.5, 1, 2), "c": ("x", "y", "z"), } ), id_column="id", weight_column="w", ) s3 = s1.set_target(s2) s3_null = s3.adjust(method="null") print(s3_null.covars().std()) # a b c[v] c[x] c[y] c[z] # source # self 0.886405 27.354812 0.5 0.377964 0.597614 0.500000 # target 0.963624 1.927248 NaN 0.462910 0.597614 0.654654 # unadjusted 0.886405 27.354812 0.5 0.377964 0.597614 0.500000
- summary(on_linked_samples: bool = True) DataFrame | str [source]¶
Returns a summary of the BalanceDF object.
This method currently calculates the mean and confidence interval (CI) for each column of the object using the
BalanceDF.mean_with_ci()
method. In the future, this method may be extended to include additional summary statistics.- Parameters:
self (BalanceDF) – The BalanceDF object.
on_linked_samples (bool, optional) – A boolean indicating whether to include linked samples when calculating the mean and CI. Defaults to True.
- Returns:
- A table with two rows for each input column: one for the mean and one for the CI.
The columns of the table are labeled with the names of the input columns.
- Return type:
Union[pd.DataFrame, str]
- to_csv(path_or_buf: str | Path | IO | None = None, *args, **kwargs) str | None [source]¶
Write df with ids from BalanceDF to a comma-separated values (csv) file.
Uses
pd.DataFrame.to_csv()
.If an ‘index’ argument is not provided then it defaults to False.
- Parameters:
self (BalanceDF) – Object.
path_or_buf (Optional[FilePathOrBuffer], optional) – location where to save the csv.
- Returns:
If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.
- Return type:
Optional[str]
- to_download(tempdir: str | None = None) FileLink [source]¶
Creates a downloadable link of the DataFrame, with ids, of the BalanceDF object.
File name starts with tmp_balance_out_, and some random file name (using
uuid.uuid4()
).- Parameters:
self (BalanceDF) – Object.
tempdir (Optional[str], optional) – Defaults to None (which then uses a temporary folder using
tempfile.gettempdir()
).
- Returns:
Embedding a local file link in an IPython session, based on path. Using :func:FileLink.
- Return type:
FileLink
- var_of_mean(on_linked_samples: bool = True, **kwargs) DataFrame [source]¶
Calculates a variance of the weighted mean on the df of the BalanceDF object.
- Parameters:
self (BalanceDF) – Object.
on_linked_samples (bool, optional) – Should the calculation be on self AND the linked samples objects? Defaults to True. If True, then uses
_call_on_linked()
with method “var_of_mean”. If False, then uses_descriptive_stats()
with method “var_of_mean”.
- Returns:
With row per object: self if on_linked_samples=False, and self and others (e.g.: target and unadjusted) if True. Columns are for each of the columns in the relevant df (after applying
model_matrix()
)- Return type:
pd.DataFrame
Examples
- ::
import pandas as pd from balance.sample_class import Sample from balance.stats_and_plots.weighted_stats import var_of_weighted_mean
- var_of_weighted_mean(pd.Series((1, 2, 3, 1)), pd.Series((0.5, 2, 1, 1)))
# 0 0.112178 # dtype: float64
# This shows we got the first cell of ‘a’ as expected.
- s1 = Sample.from_frame(
- pd.DataFrame(
- {
“a”: (1, 2, 3, 1), “b”: (-42, 8, 2, -42), “o”: (7, 8, 9, 10), “c”: (“x”, “y”, “z”, “v”), “id”: (1, 2, 3, 4), “w”: (0.5, 2, 1, 1),
}
), id_column=”id”, weight_column=”w”, outcome_columns=”o”,
)
- s2 = Sample.from_frame(
- pd.DataFrame(
- {
“a”: (1, 2, 3), “b”: (4, 6, 8), “id”: (1, 2, 3), “w”: (0.5, 1, 2), “c”: (“x”, “y”, “z”),
}
), id_column=”id”, weight_column=”w”,
)
s3 = s1.set_target(s2) s3_null = s3.adjust(method=”null”)
- print(s3_null.covars().var_of_mean())
# a b c[v] c[x] c[y] c[z] # source # self 0.112178 134.320988 0.042676 0.013413 0.082914 0.042676 # target 0.163265 0.653061 NaN 0.023324 0.069971 0.093294 # unadjusted 0.112178 134.320988 0.042676 0.013413 0.082914 0.042676
- class balance.balancedf_class.BalanceOutcomesDF(sample: Sample)[source]¶
- relative_response_rates(target: bool | DataFrame = False, per_column: bool = False) DataFrame | None [source]¶
Produces a summary table of number of responses and proportion of completed responses.
See
general_stats.relative_response_rates()
.- Parameters:
self (BalanceOutcomesDF) – Object
target (Union[bool, pd.DataFrame], optional) –
Defaults to False. Determines what is passed to df_target in
general_stats.relative_response_rates()
If False: passes None. If True: passes the df from the target of sample (notice, it’s the df of target, NOT target.outcome().df).- So it means it will count only rows that are all notnull rows (so if the target has covars and outcomes,
both will need to be notnull to be counted).
If you want to control this in a more specific way, pass pd.DataFrame instead.
If pd.DataFrame: passes it as is.
per_column (bool, optional) – Default is False. See
general_stats.relative_response_rates()
.
- Returns:
- A column per outcome, and two rows.
One row with number of non-null observations, and A second row with the proportion of non-null observations.
If ‘target’ is set to True but there is no target, the function returns None.
- Return type:
Optional[pd.DataFrame]
Examples
import numpy as np import pandas as pd from balance.sample_class import Sample s_o = Sample.from_frame( pd.DataFrame({"o1": (7, 8, 9, 10), "o2": (7, 8, 9, np.nan), "id": (1, 2, 3, 4)}), id_column="id", outcome_columns=("o1", "o2"), ) print(s_o.outcomes().relative_response_rates()) # o1 o2 # n 4.0 3.0 # % 100.0 75.0 s_o.outcomes().relative_response_rates(target = True) # None # compared with a larger target t_o = Sample.from_frame( pd.DataFrame( { "o1": (7, 8, 9, 10, 11, 12, 13, 14), "o2": (7, 8, 9, np.nan, np.nan, 12, 13, 14), "id": (1, 2, 3, 4, 5, 6, 7, 8), } ), id_column="id", outcome_columns=("o1", "o2"), ) s_o2 = s_o.set_target(t_o) print(s_o2.outcomes().relative_response_rates(True, per_column = True)) # o1 o2 # n 4.0 3.0 # % 50.0 50.0 df_target = pd.DataFrame( { "o1": (7, 8, 9, 10, 11, 12, 13, 14), "o2": (7, 8, 9, np.nan, np.nan, 12, 13, 14), } ) print(s_o2.outcomes().relative_response_rates(target = df_target, per_column = True)) # o1 o2 # n 4.0 3.0 # % 50.0 50.0
- summary(on_linked_samples: bool | None = None) str [source]¶
Produces summary printable string of a BalanceOutcomesDF object.
- Parameters:
self (BalanceOutcomesDF) – Object.
on_linked_samples (Optional[bool]) – Ignored. Only here since summary overrides BalanceDF.summary.
- Returns:
A printable string, with mean of outcome variables and response rates.
- Return type:
str
Examples
import numpy as np import pandas as pd from balance.sample_class import Sample s_o = Sample.from_frame( pd.DataFrame({"o1": (7, 8, 9, 10), "o2": (7, 8, 9, np.nan), "id": (1, 2, 3, 4)}), id_column="id", outcome_columns=("o1", "o2"), ) t_o = Sample.from_frame( pd.DataFrame( { "o1": (7, 8, 9, 10, 11, 12, 13, 14), "o2": (7, 8, 9, np.nan, np.nan, 12, 13, 14), "id": (1, 2, 3, 4, 5, 6, 7, 8), } ), id_column="id", outcome_columns=("o1", "o2"), ) s_o2 = s_o.set_target(t_o) print(s_o.outcomes().summary()) # 2 outcomes: ['o1' 'o2'] # Mean outcomes (with 95% confidence intervals): # source self self # _is_na_o2[False] 0.75 (0.326, 1.174) # _is_na_o2[True] 0.25 (-0.174, 0.674) # o1 8.50 (7.404, 9.596) # o2 6.00 (2.535, 9.465) # Response rates (relative to number of respondents in sample): # o1 o2 # n 4.0 3.0 # % 100.0 75.0 print(s_o2.outcomes().summary()) # 2 outcomes: ['o1' 'o2'] # Mean outcomes (with 95% confidence intervals): # source self target self target # _is_na_o2[False] 0.75 0.750 (0.326, 1.174) (0.45, 1.05) # _is_na_o2[True] 0.25 0.250 (-0.174, 0.674) (-0.05, 0.55) # o1 8.50 10.500 (7.404, 9.596) (8.912, 12.088) # o2 6.00 7.875 (2.535, 9.465) (4.351, 11.399) # Response rates (relative to number of respondents in sample): # o1 o2 # n 4.0 3.0 # % 100.0 75.0 # Response rates (relative to notnull rows in the target): # o1 o2 # n 4.000000 3.0 # % 66.666667 50.0 # Response rates (in the target): # o1 o2 # n 8.0 6.0 # % 100.0 75.0
- target_response_rates() DataFrame | None [source]¶
Calculates relative_response_rates for the target in a Sample object.
See
general_stats.relative_response_rates()
.- Parameters:
self (BalanceOutcomesDF) – Object (with/without a target set)
- Returns:
- None if the object doesn’t have a target.
If the object has a target, it returns the output of
general_stats.relative_response_rates()
.
- Return type:
Optional[pd.DataFrame]
Examples
import numpy as np import pandas as pd from balance.sample_class import Sample s_o = Sample.from_frame( pd.DataFrame({"o1": (7, 8, 9, 10), "o2": (7, 8, 9, np.nan), "id": (1, 2, 3, 4)}), id_column="id", outcome_columns=("o1", "o2"), ) t_o = Sample.from_frame( pd.DataFrame( { "o1": (7, 8, 9, 10, 11, 12, 13, 14), "o2": (7, 8, 9, np.nan, 11, 12, 13, 14), "id": (1, 2, 3, 4, 5, 6, 7, 8), } ), id_column="id", outcome_columns=("o1", "o2"), ) s_o = s_o.set_target(t_o) print(s_o.outcomes().target_response_rates()) # o1 o2 # n 8.0 7.0 # % 100.0 87.5
- class balance.balancedf_class.BalanceWeightsDF(sample: Sample)[source]¶
- design_effect() float64 [source]¶
Calculates Kish’s design effect (deff) on the BalanceWeightsDF weights.
Extract the first column to get a pd.Series of the weights.
See
weights_stats.design_effect()
for details.- Parameters:
self (BalanceWeightsDF) – Object.
- Returns:
Deff.
- Return type:
np.float64
- plot(on_linked_samples: bool = True, **kwargs) List | ndarray[Any, dtype[_ScalarType_co]] | Dict[str, Figure] | None [source]¶
Plots kde (kernal density estimation) of the weights in a BalanceWeightsDF object using seaborn (as default).
It’s possible to use other plots using dist_type with arguments such as “hist”, “kde” (default), “qq”, and “ecdf”. Look at
plot_dist()
for more details.- Parameters:
self (BalanceWeightsDF) – a BalanceOutcomesDF object, with a set of variables.
on_linked_samples (bool, optional) – Determines if the linked samples should be included in the plot. Defaults to True.
- Returns:
If library=”plotly” then returns a dictionary containing plots if return_dict_of_figures is True. None otherwise. If library=”seaborn” then returns None, unless return_axes is True. Then either a list or an np.array of matplotlib axis.
- Return type:
Union[Union[List, np.ndarray], Dict[str, go.Figure], None]
Examples
import numpy as np import pandas as pd from numpy import random from balance.sample_class import Sample random.seed(96483) df = pd.DataFrame({ "id": range(100), 'v1': random.random_integers(11111, 11114, size=100).astype(str), 'v2': random.normal(size = 100), 'v3': random.uniform(size = 100), "w": pd.Series(np.ones(99).tolist() + [1000]), }).sort_values(by=['v2']) s1 = Sample.from_frame(df, id_column="id", weight_column="w", outcome_columns=["v1", "v2"], ) s2 = Sample.from_frame( df.assign(w = pd.Series(np.ones(100))), id_column="id", weight_column="w", outcome_columns=["v1", "v2"], ) s3 = s1.set_target(s2) s3_null = s3.adjust(method="null") s3_null.set_weights(random.random(size = 100) + 0.5) # default: seaborn with dist_type = "kde" s3_null.weights().plot()
- summary(on_linked_samples: bool | None = None) DataFrame [source]¶
Generates a summary of a BalanceWeightsDF object.
This function provides a comprehensive overview of the BalanceWeightsDF object by calculating and returning a range of weight diagnostics.
- Parameters:
self (BalanceWeightsDF) – The BalanceWeightsDF object to be summarized.
on_linked_samples (Optional[bool], optional) – This parameter is ignored. It is only included because summary overrides BalanceDF.summary. Defaults to None.
- Returns:
- A DataFrame containing various weight diagnostics such as ‘design_effect’,
’effective_sample_proportion’, ‘effective_sample_size’, sum of weights, and basic summary statistics from describe, ‘nonparametric_skew’, and ‘weighted_median_breakdown_point’ among others.
- Return type:
pd.DataFrame
Note
The weights are normalized to sum to the sample size, n.
Examples
import pandas as pd from balance.sample_class import Sample s1 = Sample.from_frame( pd.DataFrame( { "a": (1, 2, 3, 1), "b": (-42, 8, 2, -42), "o": (7, 8, 9, 10), "c": ("x", "y", "z", "v"), "id": (1, 2, 3, 4), "w": (0.5, 2, 1, 1), } ), id_column="id", weight_column="w", outcome_columns="o", ) print(s1.weights().summary().round(2)) # var val # 0 design_effect 1.23 # 1 effective_sample_proportion 0.81 # 2 effective_sample_size 3.24 # 3 sum 4.50 # 4 describe_count 4.00 # 5 describe_mean 1.00 # 6 describe_std 0.56 # 7 describe_min 0.44 # 8 describe_25% 0.78 # 9 describe_50% 0.89 # 10 describe_75% 1.11 # 11 describe_max 1.78 # 12 prop(w < 0.1) 0.00 # 13 prop(w < 0.2) 0.00 # 14 prop(w < 0.333) 0.00 # 15 prop(w < 0.5) 0.25 # 16 prop(w < 1) 0.75 # 17 prop(w >= 1) 0.25 # 18 prop(w >= 2) 0.00 # 19 prop(w >= 3) 0.00 # 20 prop(w >= 5) 0.00 # 21 prop(w >= 10) 0.00 # 22 nonparametric_skew 0.20 # 23 weighted_median_breakdown_point 0.25
- trim(ratio: float | int | None = None, percentile: float | None = None, keep_sum_of_weights: bool = True) None [source]¶
Trim weights in the sample object.
Uses
adjustments.trim_weights()
for the weights trimming.- Parameters:
self (BalanceWeightsDF) – Object.
ratio (Optional[Union[float, int]], optional) – Maps to weight_trimming_mean_ratio. Defaults to None.
percentile (Optional[float], optional) – Maps to weight_trimming_percentile. Defaults to None.
keep_sum_of_weights (bool, optional) – Maps to weight_trimming_percentile. Defaults to True.
- Returns:
None. This function updates the
_sample()
usingset_weights()