balance.sample_class¶
- class balance.sample_class.Sample[source]¶
A class used to represent a sample.
Sample is the main object of balance. It contains a dataframe of unit’s observations, associated with id and weight.
- id_column¶
a column representing the ids of the units in sample
- Type:
pd.Series
- weight_column¶
a column representing the weights of the units in sample
- Type:
pd.Series
- adjust(target: Sample | None = None, method: Literal['cbps', 'ipw', 'null', 'poststratify', 'rake'] | Callable = 'ipw', *args, **kwargs) Sample [source]¶
Perform adjustment of one sample to match another. This function returns a new sample.
- Parameters:
target (Optional["Sample"]) – Second sample object which should be matched. If None, the set target of the object is used for matching.
method (str) – method for adjustment: cbps, ipw, null, poststratify, rake
- Returns:
an adjusted Sample object
- Return type:
- covar_means() DataFrame [source]¶
Compare the means of covariates (after using
BalanceDF.model_matrix()
) before and after adjustment as compared with target.- Parameters:
self (Sample) – A Sample object produces after running
Sample.adjust()
. It should include 3 components: “unadjusted”, “adjusted”, “target”.- Returns:
A DataFrame with 3 columns (“unadjusted”, “adjusted”, “target”), and a row for each feature of the covariates. The cells show the mean value. For categorical features, they are first transformed into the one-hot encoding. For these columns, since they are all either 0 or 1, their means should be interpreted as proportions.
- Return type:
pd.DataFrame
Examples
from balance import Sample import pandas as pd s = Sample.from_frame( pd.DataFrame( {"a": (0, 1, 2), "c": ("a", "b", "c"), "o": (1,3,5), "id": (1, 2, 3)} ), outcome_columns=("o"), ) s_adjusted = s.set_target(s).adjust(method = 'null') print(s_adjusted.covar_means()) # source unadjusted adjusted target # a 1.000000 1.000000 1.000000 # c[a] 0.333333 0.333333 0.333333 # c[b] 0.333333 0.333333 0.333333 # c[c] 0.333333 0.333333 0.333333
- covars()[source]¶
Produce a BalanceCovarsDF from a Sample object. See :class:BalanceCovarsDF.
- Parameters:
self (Sample) – Sample object.
- Returns:
BalanceCovarsDF
- design_effect() float64 [source]¶
Return the design effect of the weights of Sample. Uses
weights_stats.design_effect()
.- Parameters:
self (Sample) – A Sample object
- Returns:
Design effect
- Return type:
np.float64
- design_effect_prop() float64 [source]¶
Return the relative difference in design effect of the weights of the unadjusted sample and the adjusted sample. I.e. (Deff of adjusted - Deff of unadjusted) / Deff of unadjusted. Uses
weights_stats.design_effect()
.- Parameters:
self (Sample) – A Sample object produces after running
Sample.adjust()
. It should include 3 components: “unadjusted”, “adjusted”, “target”.- Returns:
relative difference in design effect.
- Return type:
np.float64
- property df: DataFrame¶
Produce a DataFrame (of the self) from a Sample object.
- Parameters:
self (Sample) – Sample object.
- Returns:
with id_columns, and the df values of covars(), outcome() and weights() of the self in the Sample object.
- Return type:
pd.DataFrame
- diagnostics() DataFrame [source]¶
Output a table of diagnostics about adjusted Sample object.
size¶
All values in the “size” metrics are AFTER any rows/columns were filtered. So, for example, if we use respondents from previous days but filter them for diagnostics purposes, then sample_obs and target_obs will NOT include them in the counting. The same is true for sample_covars and target_covars. In the “size” metrics we have the following ‘var’s: - sample_obs - number of respondents - sample_covars - number of covariates (main covars, before any transformations were used) - target_obs - number of users used to represent the target pop - target_covars - like sample_covars, but for target.
weights_diagnostics¶
In the “weights_diagnostics” metric we have the following ‘var’s: - design effect (de), effective sample size (n/de), effective sample ratio (1/de). See also:
sum
describe of the (normalized to sample size) weights (mean, median, std, etc.)
prop of the (normalized to sample size) weights that are below or above some numbers (1/2, 1, 2, etc.)
nonparametric_skew and weighted_median_breakdown_point
Why is the diagnostics focused on weights normalized to sample size¶
There are 3 well known normalizations of weights: 1. to sum to 1 2. to sum to target population 3. to sum to n (sample size)
Each one has their own merits: 1. is good if wanting to easily calculate avg of some response var (then we just use sum(w*y) and no need for /sum(w)) 2. is good for sum of stuff. For example, how many people in the US use android? For this we’d like the weight of
each person to represent their share of the population and then we just sum the weights of the people who use android in the survey.
- is good for understanding relative “importance” of a respondent as compared to the weights of others in the survey.
So if someone has a weight that is >1 it means that this respondent (conditional on their covariates) was ‘rare’ in the survey, so the model we used decided to give them a larger weight to account for all the people like him/her that didn’t answer.
For diagnostics purposes, option 3 is most useful for discussing the distribution of the weights (e.g.: how many respondents got a weight >2 or smaller <0.5). This is a method (standardized across surveys) to helping us identify how many of the respondents are “dominating” and have a large influence on the conclusion we draw from the survey.
model_glance¶
Properties of the model fitted, depends on the model used for weighting.
covariates ASMD¶
Includes covariates ASMD before and after adjustment (per level of covariate and aggregated) and the ASMD improvement.
- param self:
only after running an adjustment with Sample.adjust.
- type self:
Sample
- returns:
- with 3 columns: (“metric”, “val”, “var”),
indicating various tracking metrics on the model.
- rtype:
pd.DataFrame
- classmethod from_frame(df: DataFrame, id_column: str | None = None, outcome_columns: list | tuple | str | None = None, weight_column: str | None = None, check_id_uniqueness: bool = True, standardize_types: bool = True, use_deepcopy: bool = True) Sample [source]¶
Create a new Sample object.
NOTE that all integer columns will be converted by defaults into floats. This behavior can be turned off by setting standardize_types argument to False. The reason this is done by default is because of missing value handling combined with balance current lack of support for pandas Integer types:
1. Native numpy integers do not support missing values (NA), while pandas Integers do, as well numpy floats. Also, 2. various functions in balance do not support pandas Integers, while they do support numpy floats. 3. Hence, since some columns might have missing values, the safest solution is to just convert all integers into numpy floats.
The id_column is stored as a string, even if the input is an integer.
- Parameters:
df (pd.DataFrame) – containing the sample’s data
id_column (Optional, Optional[str]) – the column of the df which contains the respondent’s id
None. ((should be unique). Defaults to)
outcome_columns (Optional, Optional[Union[list, tuple, str]]) – names of columns to treat as outcome
weight_column (Optional, Optional[str]) – name of column to treat as weight. If not specified, will be guessed (either “weight” or “weights”). If not found, a new column will be created (“weight”) and filled with 1.0.
check_id_uniqueness (Optional, bool) – Whether to check if ids are unique. Defaults to True.
standardize_types (Optional, bool) – Whether to standardize types. Defaults to True. Int64/int64 -> float64 Int32/int32 -> float64 string -> object pandas.NA -> numpy.nan (within each cell) This is slightly memory intensive (since it copies the data twice), but helps keep various functions working for both Int64 and Int32 input columns.
use_deepcopy (Optional, bool) – Whether to have a new df copy inside the sample object. If False, then when the sample methods update the internal df then the original df will also be updated. Defaults to True.
- Returns:
a sample object
- Return type:
- has_target() bool [source]¶
Check if a Sample object has target attached.
- Returns:
whether the Sample has target attached
- Return type:
bool
- is_adjusted() bool [source]¶
Check if a Sample object is adjusted and has target attached
- Returns:
whether the Sample is adjusted or not.
- Return type:
bool
- keep_only_some_rows_columns(rows_to_keep: str | None = None, columns_to_keep: List[str] | None = None) Sample [source]¶
This function returns a copy of the sample object after removing ALL columns from _df and _links objects (which includes unadjusted and target objects).
This function is useful when wanting to calculate metrics, such as ASMD, but only on some of the features, or part of the observations.
- Parameters:
self (Sample) – a sample object (preferably after adjustment)
rows_to_keep (Optional[str], optional) – A string with a condition to eval (on some of the columns). This will run df.eval(rows_to_keep) which will return a pd.Series of bool by which we will filter the Sample object. This effects both the df of covars AND the weights column (weight_column) AND the outcome column (_outcome_columns), AND the id_column column. Input should be a boolean feature, or a condition such as: ‘gender == “Female” & age >= 18’. Defaults to None.
columns_to_keep (Optional[List[str]], optional) – the covariates of interest. Defaults to None, which returns all columns.
- Returns:
- A copy of the original object. If both rows and columns to keep are None,
returns the copied object unchanged. If some are not None, will update - first the rows - then the columns. This performs the transformation on both the sample’s df and its linked dfs (unadjusted, target).
- Return type:
- model() Dict | None [source]¶
Returns the name of the model used to adjust Sample if adjusted. Otherwise returns None.
- Parameters:
self (Sample) – Sample object.
- Returns:
name of model used for adjusting Sample
- Return type:
str or None
- model_matrix() DataFrame [source]¶
Returns the model matrix of sample using
model_matrix()
, while adding na indicator for null values (seeadd_na_indicator()
).- Returns:
model matrix of sample
- Return type:
pd.DataFrame
- outcome_sd_prop() Series [source]¶
Return the difference in outcome weighted standard deviation (sd) of the unadjusted sample and the adjusted sample, relative to the unadjusted weighted sd. I.e. (weighted sd of adjusted - weighted sd of unadjusted) / weighted sd of unadjusted. Uses
BalanceDF.weighted_stats.weighted_sd()
.- Parameters:
self (Sample) – A Sample object produces after running
Sample.adjust()
. It should include 3 components: “unadjusted”, “adjusted”, “target”.- Returns:
(np.float64) relative difference in outcome weighted standard deviation.
- Return type:
pd.Series
- outcome_variance_ratio() Series [source]¶
The empirical ratio of variance of the outcomes before and after weighting.
See
outcome_variance_ratio()
for details.- Parameters:
self (Sample) – A Sample object produces after running
Sample.adjust()
. It should include 3 components: “unadjusted”, “adjusted”, “target”.- Returns:
(np.float64) A series of calculated ratio of variances for each outcome.
- Return type:
pd.Series
- outcomes()[source]¶
Produce a BalanceOutcomeDF from a Sample object. See :class:BalanceOutcomesDF.
- Parameters:
self (Sample) – Sample object.
- Returns:
BalanceOutcomesDF or None
- plot_weight_density() None [source]¶
Plot the density of weights of Sample.
Examples
import numpy as np import pandas as pd from balance.sample_class import Sample np.random.seed(123) df = pd.DataFrame( { "a": np.random.uniform(size=100), "c": np.random.choice( ["a", "b", "c", "d"], size=100, replace=True, p=[0.01, 0.04, 0.5, 0.45], ), "id": range(100), "weight": np.random.uniform(size=100) + 0.5, } ) a = Sample.from_frame(df) sample.weights().plot() # The same as: sample.plot_weight_density()
- set_unadjusted(second_sample: Sample) Sample [source]¶
Used to set the unadjusted link to Sample. This is useful in case one wants to compare two samples.
- set_weights(weights: Series | float | None) None [source]¶
Adjusting the weights of a Sample object. This will overwrite the weight_column of the Sample. Note that the weights are assigned by index if weights is a pd.Series (of Sample.df and weights series)
- Parameters:
weights (Optional[Union[pd.Series, float]]) – Series of weights to add to sample. If None or float values, the same weight (or None) will be assigned to all units.
- Returns:
None, but adapting the Sample weight column to weights
- summary() str [source]¶
Provides a summary of covariate balance, design effect and model properties (if applicable) of a sample.
For more details see:
BalanceDF.asmd()
,BalanceDF.asmd_improvement()
andweights_stats.design_effect()
- Returns:
a summary description of properties of an adjusted sample.
- Return type:
str
- to_csv(path_or_buf: str | Path | IO | None = None, **kwargs) str | None [source]¶
Write df with ids from BalanceDF to a comma-separated values (csv) file.
Uses
pd.DataFrame.to_csv()
.If an ‘index’ argument is not provided then it defaults to False.
- Parameters:
self – Object.
path_or_buf (Optional[FilePathOrBuffer], optional) – location where to save the csv.
- Returns:
If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.
- Return type:
Optional[str]
- to_download(tempdir: str | None = None) FileLink [source]¶
Creates a downloadable link of the DataFrame of the Sample object.
File name starts with tmp_balance_out_, and some random file name (using
uuid.uuid4()
).- Parameters:
self (Sample) – Object.
tempdir (Optional[str], optional) – Defaults to None (which then uses a temporary folder using
tempfile.gettempdir()
).
- Returns:
Embedding a local file link in an IPython session, based on path. Using :func:FileLink.
- Return type:
FileLink