balance.sample_class

class balance.sample_class.Sample[source]

A class used to represent a sample.

Sample is the main object of balance. It contains a dataframe of unit’s observations, associated with id and weight.

id_column

a column representing the ids of the units in sample

Type:

pd.Series

weight_column

a column representing the weights of the units in sample

Type:

pd.Series

adjust(target: Sample | None = None, method: Literal['cbps', 'ipw', 'null', 'poststratify', 'rake'] | Callable = 'ipw', *args, **kwargs) Sample[source]

Perform adjustment of one sample to match another. This function returns a new sample.

Parameters:
  • target (Optional["Sample"]) – Second sample object which should be matched. If None, the set target of the object is used for matching.

  • method (str) – method for adjustment: cbps, ipw, null, poststratify, rake

Returns:

an adjusted Sample object

Return type:

Sample

covar_means() DataFrame[source]

Compare the means of covariates (after using BalanceDF.model_matrix()) before and after adjustment as compared with target.

Parameters:

self (Sample) – A Sample object produces after running Sample.adjust(). It should include 3 components: “unadjusted”, “adjusted”, “target”.

Returns:

A DataFrame with 3 columns (“unadjusted”, “adjusted”, “target”), and a row for each feature of the covariates. The cells show the mean value. For categorical features, they are first transformed into the one-hot encoding. For these columns, since they are all either 0 or 1, their means should be interpreted as proportions.

Return type:

pd.DataFrame

Examples

from balance import Sample
import pandas as pd

s = Sample.from_frame(
    pd.DataFrame(
        {"a": (0, 1, 2), "c": ("a", "b", "c"), "o": (1,3,5), "id": (1, 2, 3)}
    ),
    outcome_columns=("o"),
)
s_adjusted = s.set_target(s).adjust(method = 'null')
print(s_adjusted.covar_means())

    # source  unadjusted  adjusted    target
    # a         1.000000  1.000000  1.000000
    # c[a]      0.333333  0.333333  0.333333
    # c[b]      0.333333  0.333333  0.333333
    # c[c]      0.333333  0.333333  0.333333
covars()[source]

Produce a BalanceCovarsDF from a Sample object. See :class:BalanceCovarsDF.

Parameters:

self (Sample) – Sample object.

Returns:

BalanceCovarsDF

design_effect() float64[source]

Return the design effect of the weights of Sample. Uses weights_stats.design_effect().

Parameters:

self (Sample) – A Sample object

Returns:

Design effect

Return type:

np.float64

design_effect_prop() float64[source]

Return the relative difference in design effect of the weights of the unadjusted sample and the adjusted sample. I.e. (Deff of adjusted - Deff of unadjusted) / Deff of unadjusted. Uses weights_stats.design_effect().

Parameters:

self (Sample) – A Sample object produces after running Sample.adjust(). It should include 3 components: “unadjusted”, “adjusted”, “target”.

Returns:

relative difference in design effect.

Return type:

np.float64

property df: DataFrame

Produce a DataFrame (of the self) from a Sample object.

Parameters:

self (Sample) – Sample object.

Returns:

with id_columns, and the df values of covars(), outcome() and weights() of the self in the Sample object.

Return type:

pd.DataFrame

diagnostics() DataFrame[source]

Output a table of diagnostics about adjusted Sample object.

size

All values in the “size” metrics are AFTER any rows/columns were filtered. So, for example, if we use respondents from previous days but filter them for diagnostics purposes, then sample_obs and target_obs will NOT include them in the counting. The same is true for sample_covars and target_covars. In the “size” metrics we have the following ‘var’s: - sample_obs - number of respondents - sample_covars - number of covariates (main covars, before any transformations were used) - target_obs - number of users used to represent the target pop - target_covars - like sample_covars, but for target.

weights_diagnostics

In the “weights_diagnostics” metric we have the following ‘var’s: - design effect (de), effective sample size (n/de), effective sample ratio (1/de). See also:

  • sum

  • describe of the (normalized to sample size) weights (mean, median, std, etc.)

  • prop of the (normalized to sample size) weights that are below or above some numbers (1/2, 1, 2, etc.)

  • nonparametric_skew and weighted_median_breakdown_point

Why is the diagnostics focused on weights normalized to sample size

There are 3 well known normalizations of weights: 1. to sum to 1 2. to sum to target population 3. to sum to n (sample size)

Each one has their own merits: 1. is good if wanting to easily calculate avg of some response var (then we just use sum(w*y) and no need for /sum(w)) 2. is good for sum of stuff. For example, how many people in the US use android? For this we’d like the weight of

each person to represent their share of the population and then we just sum the weights of the people who use android in the survey.

  1. is good for understanding relative “importance” of a respondent as compared to the weights of others in the survey.

    So if someone has a weight that is >1 it means that this respondent (conditional on their covariates) was ‘rare’ in the survey, so the model we used decided to give them a larger weight to account for all the people like him/her that didn’t answer.

For diagnostics purposes, option 3 is most useful for discussing the distribution of the weights (e.g.: how many respondents got a weight >2 or smaller <0.5). This is a method (standardized across surveys) to helping us identify how many of the respondents are “dominating” and have a large influence on the conclusion we draw from the survey.

model_glance

Properties of the model fitted, depends on the model used for weighting.

covariates ASMD

Includes covariates ASMD before and after adjustment (per level of covariate and aggregated) and the ASMD improvement.

param self:

only after running an adjustment with Sample.adjust.

type self:

Sample

returns:
with 3 columns: (“metric”, “val”, “var”),

indicating various tracking metrics on the model.

rtype:

pd.DataFrame

classmethod from_frame(df: DataFrame, id_column: str | None = None, outcome_columns: list | tuple | str | None = None, weight_column: str | None = None, check_id_uniqueness: bool = True, standardize_types: bool = True, use_deepcopy: bool = True) Sample[source]

Create a new Sample object.

NOTE that all integer columns will be converted by defaults into floats. This behavior can be turned off by setting standardize_types argument to False. The reason this is done by default is because of missing value handling combined with balance current lack of support for pandas Integer types:

1. Native numpy integers do not support missing values (NA), while pandas Integers do, as well numpy floats. Also, 2. various functions in balance do not support pandas Integers, while they do support numpy floats. 3. Hence, since some columns might have missing values, the safest solution is to just convert all integers into numpy floats.

The id_column is stored as a string, even if the input is an integer.

Parameters:
  • df (pd.DataFrame) – containing the sample’s data

  • id_column (Optional, Optional[str]) – the column of the df which contains the respondent’s id

  • None. ((should be unique). Defaults to)

  • outcome_columns (Optional, Optional[Union[list, tuple, str]]) – names of columns to treat as outcome

  • weight_column (Optional, Optional[str]) – name of column to treat as weight. If not specified, will be guessed (either “weight” or “weights”). If not found, a new column will be created (“weight”) and filled with 1.0.

  • check_id_uniqueness (Optional, bool) – Whether to check if ids are unique. Defaults to True.

  • standardize_types (Optional, bool) – Whether to standardize types. Defaults to True. Int64/int64 -> float64 Int32/int32 -> float64 string -> object pandas.NA -> numpy.nan (within each cell) This is slightly memory intensive (since it copies the data twice), but helps keep various functions working for both Int64 and Int32 input columns.

  • use_deepcopy (Optional, bool) – Whether to have a new df copy inside the sample object. If False, then when the sample methods update the internal df then the original df will also be updated. Defaults to True.

Returns:

a sample object

Return type:

Sample

has_target() bool[source]

Check if a Sample object has target attached.

Returns:

whether the Sample has target attached

Return type:

bool

is_adjusted() bool[source]

Check if a Sample object is adjusted and has target attached

Returns:

whether the Sample is adjusted or not.

Return type:

bool

keep_only_some_rows_columns(rows_to_keep: str | None = None, columns_to_keep: List[str] | None = None) Sample[source]

This function returns a copy of the sample object after removing ALL columns from _df and _links objects (which includes unadjusted and target objects).

This function is useful when wanting to calculate metrics, such as ASMD, but only on some of the features, or part of the observations.

Parameters:
  • self (Sample) – a sample object (preferably after adjustment)

  • rows_to_keep (Optional[str], optional) – A string with a condition to eval (on some of the columns). This will run df.eval(rows_to_keep) which will return a pd.Series of bool by which we will filter the Sample object. This effects both the df of covars AND the weights column (weight_column) AND the outcome column (_outcome_columns), AND the id_column column. Input should be a boolean feature, or a condition such as: ‘gender == “Female” & age >= 18’. Defaults to None.

  • columns_to_keep (Optional[List[str]], optional) – the covariates of interest. Defaults to None, which returns all columns.

Returns:

A copy of the original object. If both rows and columns to keep are None,

returns the copied object unchanged. If some are not None, will update - first the rows - then the columns. This performs the transformation on both the sample’s df and its linked dfs (unadjusted, target).

Return type:

Sample

model() Dict | None[source]

Returns the name of the model used to adjust Sample if adjusted. Otherwise returns None.

Parameters:

self (Sample) – Sample object.

Returns:

name of model used for adjusting Sample

Return type:

str or None

model_matrix() DataFrame[source]

Returns the model matrix of sample using model_matrix(), while adding na indicator for null values (see add_na_indicator()).

Returns:

model matrix of sample

Return type:

pd.DataFrame

outcome_sd_prop() Series[source]

Return the difference in outcome weighted standard deviation (sd) of the unadjusted sample and the adjusted sample, relative to the unadjusted weighted sd. I.e. (weighted sd of adjusted - weighted sd of unadjusted) / weighted sd of unadjusted. Uses BalanceDF.weighted_stats.weighted_sd().

Parameters:

self (Sample) – A Sample object produces after running Sample.adjust(). It should include 3 components: “unadjusted”, “adjusted”, “target”.

Returns:

(np.float64) relative difference in outcome weighted standard deviation.

Return type:

pd.Series

outcome_variance_ratio() Series[source]

The empirical ratio of variance of the outcomes before and after weighting.

See outcome_variance_ratio() for details.

Parameters:

self (Sample) – A Sample object produces after running Sample.adjust(). It should include 3 components: “unadjusted”, “adjusted”, “target”.

Returns:

(np.float64) A series of calculated ratio of variances for each outcome.

Return type:

pd.Series

outcomes()[source]

Produce a BalanceOutcomeDF from a Sample object. See :class:BalanceOutcomesDF.

Parameters:

self (Sample) – Sample object.

Returns:

BalanceOutcomesDF or None

plot_weight_density() None[source]

Plot the density of weights of Sample.

Examples

import numpy as np
import pandas as pd
from balance.sample_class import Sample


np.random.seed(123)
df = pd.DataFrame(
    {
        "a": np.random.uniform(size=100),
        "c": np.random.choice(
            ["a", "b", "c", "d"],
            size=100,
            replace=True,
            p=[0.01, 0.04, 0.5, 0.45],
        ),
        "id": range(100),
        "weight": np.random.uniform(size=100) + 0.5,
    }
)

a = Sample.from_frame(df)
sample.weights().plot()
# The same as:
sample.plot_weight_density()
set_target(target: Sample) Sample[source]

Used to set the target linked to Sample.

Parameters:

target (Sample) – A Sample object to be linked as target

Returns:

new copy of Sample with target link attached

Return type:

Sample

set_unadjusted(second_sample: Sample) Sample[source]

Used to set the unadjusted link to Sample. This is useful in case one wants to compare two samples.

Parameters:

second_sample (Sample) – A second Sample to be set as unadjusted of Sample.

Returns:

a new copy of Sample with unadjusted link attached to the self object.

Return type:

Sample

set_weights(weights: Series | float | None) None[source]

Adjusting the weights of a Sample object. This will overwrite the weight_column of the Sample. Note that the weights are assigned by index if weights is a pd.Series (of Sample.df and weights series)

Parameters:

weights (Optional[Union[pd.Series, float]]) – Series of weights to add to sample. If None or float values, the same weight (or None) will be assigned to all units.

Returns:

None, but adapting the Sample weight column to weights

summary() str[source]

Provides a summary of covariate balance, design effect and model properties (if applicable) of a sample.

For more details see: BalanceDF.asmd(), BalanceDF.asmd_improvement() and weights_stats.design_effect()

Returns:

a summary description of properties of an adjusted sample.

Return type:

str

to_csv(path_or_buf: str | Path | IO | None = None, **kwargs) str | None[source]

Write df with ids from BalanceDF to a comma-separated values (csv) file.

Uses pd.DataFrame.to_csv().

If an ‘index’ argument is not provided then it defaults to False.

Parameters:
  • self – Object.

  • path_or_buf (Optional[FilePathOrBuffer], optional) – location where to save the csv.

Returns:

If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

Return type:

Optional[str]

to_download(tempdir: str | None = None) FileLink[source]

Creates a downloadable link of the DataFrame of the Sample object.

File name starts with tmp_balance_out_, and some random file name (using uuid.uuid4()).

Parameters:
  • self (Sample) – Object.

  • tempdir (Optional[str], optional) – Defaults to None (which then uses a temporary folder using tempfile.gettempdir()).

Returns:

Embedding a local file link in an IPython session, based on path. Using :func:FileLink.

Return type:

FileLink

weights()[source]

Produce a BalanceWeightsDF from a Sample object. See :class:BalanceWeightsDF.

Parameters:

self (Sample) – Sample object.

Returns:

BalanceWeightsDF