balance.util

class balance.util.TruncationFormatter(*args, **kwargs)[source]

Logging formatter which truncates the logged message to 500 characters.

This is useful in the cases where the logging message includes objects — like DataFrames — whose string representation is very long.

format(record: LogRecord)[source]

Format the specified record as text.

The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.

balance.util.add_na_indicator(df: DataFrame, replace_val_obj: str = '_NA', replace_val_num: int = 0) DataFrame[source]

If a column in the DataFrame contains NAs, replace these with 0 for numerical columns or “_NA” for non-numerical columns, and add another column of an indicator variable for which rows were NA.

Parameters:
  • df (pd.DataFrame) – The input DataFrame

  • replace_val_obj (str, optional) – The value to put instead of nulls for object columns. Defaults to “_NA”.

  • replace_val_num (int, optional) – The value to put instead of nulls for numeric columns. Defaults to 0.

Raises:
  • Exception – Can’t add NA indicator to DataFrame which contains columns which start with ‘_is_na_’

  • Exception – Can’t add NA indicator to columns containing NAs and the value ‘{replace_val_obj}’,

Returns:

New dataframe with additional columns

Return type:

pd.DataFrame

balance.util.auto_spread(data: DataFrame, features: list | None = None, id_: str = 'id') DataFrame[source]

Automatically transform a ‘long’ DataFrame into a ‘wide’ DataFrame by guessing which column should be used as a key, treating all other columns as values. At the moment, this will only find a single key column

Parameters:
  • data (pd.DataFrame)

  • features (Optional[list], optional) – Defaults to None.

  • id (str, optional) – Defaults to “id”.

Returns:

pd.DataFrame

balance.util.build_model_matrix(df: DataFrame, formula: str = '.', factor_variables: List | None = None, return_sparse: bool = False) Dict[str, Any][source]

Build a model matrix from a formula (using patsy.dmatrix)

Parameters:
  • df (pd.DataFrame) – The data from which to create the model matrix (pandas dataframe)

  • formula (str, optional) – a string representing the formula to use for building the model matrix. Default is additive formula with all variables in df. Defaults to “.”.

  • factor_variables (LisOptional[List]t, optional) – list of names of factor variables that we use one_hot_encoding_greater_2 for. Default is None, in which case no special contrasts are applied (uses patsy defaults). one_hot_encoding_greater_2 creates one-hot-encoding for all categorical variables with more than 2 categories (i.e. the number of columns will be equal to the number of categories), and only 1 column for variables with 2 levels (treatment contrast).

  • return_sparse (bool, optional) – whether to return a sparse matrix using scipy.sparse.csc_matrix. Defaults to False.

Raises:
  • Exception – “Variable names cannot contain characters ‘[’ or ‘]’”

  • Exception – “Not all factor variables are contained in df”

Returns:

A dictionary of 2 elements:
  1. model_matrix - this is a pd dataframe or a csc_matrix (depends on return_sparse), ordered by columns names

2. model_matrix_columns - A list of the columns names of model_matrix (We include model_matrix_columns as a separate argument since if we return a sparse X_matrix, it doesn’t have a columns names argument and these need to be kept separately, see here: https://stackoverflow.com/questions/35086940/how-can-i-give-row-and-column-names-to-scipys-csr-matrix.)

Return type:

Dict[str, Any]

Examples

import pandas as pd
d = {'a': ['a1','a2','a1','a1'], 'b': ['b1','b2','b3','b3']}
df = pd.DataFrame(data=d)

print(build_model_matrix(df, 'a'))
    # {'model_matrix':    a[a1]  a[a2]
    # 0    1.0    0.0
    # 1    0.0    1.0
    # 2    1.0    0.0
    # 3    1.0    0.0,
    # 'model_matrix_columns': ['a[a1]', 'a[a2]']}


print(build_model_matrix(df, '.'))
    # {'model_matrix':    a[a1]  a[a2]  b[T.b2]  b[T.b3]
    # 0    1.0    0.0      0.0      0.0
    # 1    0.0    1.0      1.0      0.0
    # 2    1.0    0.0      0.0      1.0
    # 3    1.0    0.0      0.0      1.0,
    # 'model_matrix_columns': ['a[a1]', 'a[a2]', 'b[T.b2]', 'b[T.b3]']}


print(build_model_matrix(df, '.', factor_variables=['a']))
    # {'model_matrix':    C(a, one_hot_encoding_greater_2)[a2]  b[T.b2]  b[T.b3]
    # 0                                0.0      0.0      0.0
    # 1                                1.0      1.0      0.0
    # 2                                0.0      0.0      1.0
    # 3                                0.0      0.0      1.0,
    # 'model_matrix_columns': ['C(a, one_hot_encoding_greater_2)[a2]', 'b[T.b2]', 'b[T.b3]']}


print(build_model_matrix(df, 'a', return_sparse=True))
    # {'model_matrix': <4x2 sparse matrix of type '<class 'numpy.float64'>'
    # with 4 stored elements in Compressed Sparse Column format>, 'model_matrix_columns': ['a[a1]', 'a[a2]']}
print(build_model_matrix(df, 'a', return_sparse=True)["model_matrix"].toarray())
    # [[1. 0.]
    # [0. 1.]
    # [1. 0.]
    # [1. 0.]]
balance.util.choose_variables(*dfs: DataFrame | Any, variables: List | set | None = None, df_for_var_order: int = 0) List[str][source]

Returns a list of joint (intersection of) variables present in all the input dataframes and also in the variables set or list if provided. The order of the returned variables is conditional on the input:

  • If a variables argument is supplied as a list - the order will be based on the order in the variables list.

  • If a variables is not a list (e.g.: set or None), the order is determined by the order of the columns in the dataframes

    supplied. The dataframe chosen for the order is determined by the df_for_var_order argument. 0 means the order from the first df, 1 means the order from the second df, etc.

Parameters:
  • *dfs (Union[pd.DataFrame, Any]) – One or more pandas.DataFrames or balance.Samples.

  • variables (Optional[Union[List, set]]) – The variables to choose from. If None, returns all joint variables found in the input dataframes. Defaults to None.

  • df_for_var_order – Index of the dataframe used to determine the order of the variables in the output list. Defaults to 0. This is used only if the variables argument is not a list (e.g.: a set or None).

balance.util.dot_expansion(formula, variables: List)[source]

Build a formula string by replacing “.” with “summing” all the variables, If no dot appears, returns the formula as is.

This function is named for the ‘dot’ operators in R, where a formula given as ‘ ~ .’ means “use all variables in dataframe.

Parameters:
  • formula – The formula to expand.

  • variables (List) – List of all variables in the dataframe we build the formula for.

Raises:
  • Exception – “Variables should not be empty. Please provide a list of strings.”

  • Exception – “Variables should be a list of strings and have to be included.”

Returns:

A string formula replacing the ‘.’’ with all variables in variables. If no ‘.’ is present, then the original formula is returned as is.

Examples

dot_expansion('.', ['a','b','c','d']) # (a+b+c+d)
dot_expansion('b:(. - a)', ['a','b','c','d']) # b:((a+b+c+d) - a)
dot_expansion('a*b', ['a','b','c','d']) # a*b
dot_expansion('.', None) # Raise error

import pandas as pd
d = {'a': ['a1','a2','a1','a1'], 'b': ['b1','b2','b3','b3'],
            'c': ['c1','c1','c2','c1'], 'd':['d1','d1','d2','d3']}
df = pd.DataFrame(data=d)
dot_expansion('.', df) # Raise error
dot_expansion('.', list(df.columns)) # (a+b+c+d)
balance.util.drop_na_rows(sample_df: DataFrame, sample_weights: Series, name: str = 'sample object') Tuple[DataFrame, Series][source]

Drop rows with missing values in sample_df and their corresponding weights, and the same in target_df.

Parameters:
  • sample_df (pd.DataFrame) – a dataframe representing the sample or target

  • sample_weights (pd.Series) – design weights for sample or target

  • name (str, optional) – name of object checked (used for warnings prints). Defaults to “sample object”.

Raises:

ValueError – Dropping rows led to empty {name}. Maybe try na_action=’add_indicator’?

Returns:

sample_df, sample_weights without NAs rows

Return type:

Tuple[pd.DataFrame, pd.Series]

balance.util.fct_lump(s: Series, prop: float = 0.05) Series[source]

Lumps infrequent levels into ‘_lumped_other’. Note that all values with proportion less than prop output the same value ‘_lumped_other’.

Parameters:
  • s (pd.Series) – pd.series to lump, with dtype of integer, numeric, object, or category (category will be converted to object)

  • prop (float, optional) – the proportion of infrequent levels to lump. Defaults to 0.05.

Returns:

pd.series (with category dtype converted to object, if applicable)

Return type:

pd.Series

Examples

from balance.util import fct_lump
import pandas as pd

s = pd.Series(['a','a','b','b','c','a','b'], dtype = 'category')
fct_lump(s, 0.25)
    # 0                a
    # 1                a
    # 2                b
    # 3                b
    # 4    _lumped_other
    # 5                a
    # 6                b
    # dtype: object
balance.util.fct_lump_by(s: Series, by: Series, prop: float = 0.05) Series[source]

Lumps infrequent levels into ‘_lumped_other, only does so per value of the grouping variable by. Useful, for example, for keeping the most important interactions in a model.

Parameters:
  • s (pd.Series) – pd.series to lump

  • by (pd.Series) – pd.series according to which group the data

  • prop (float, optional) – the proportion of infrequent levels to lump. Defaults to 0.05.

Returns:

pd.series, we keep the index of s as the index of the result.

Return type:

pd.Series

Examples

s = pd.Series([1,1,1,2,3,1,2])
by = pd.Series(['a','a','a','a','a','b','b'])
fct_lump_by(s, by, 0.5)
    # 0                1
    # 1                1
    # 2                1
    # 3    _lumped_other
    # 4    _lumped_other
    # 5                1
    # 6                2
    # dtype: object
balance.util.find_items_index_in_list(a_list: List[Any], items: List[Any]) List[int][source]

Finds the index location of a given item in an array.

Helpful references:
Parameters:
  • x (List[Any]) – a list of items to find their index

  • items (List[Any]) – a list of items to search for

Returns:

a list of indices of the items in x that appear in the items list.

Return type:

List[int]

Examples

l = [1,2,3,4,5,6,7]
items = [2,7]
find_items_index_in_list(l, items)
    # [1, 6]

items = [1000]
find_items_index_in_list(l, items)
    # []

l = ["a", "b", "c"]
items = ["c", "c", "a"]
find_items_index_in_list(l, items)
    # [2, 2, 0]

type(find_items_index_in_list(l, items)[0])
    # int
balance.util.formula_generator(variables, formula_type: str = 'additive') str[source]
Create formula to build the model matrix

Default is additive formula.

Parameters:
  • variables – list with names of variables (as strings) to combine into a formula

  • formula_type (str, optional) – how to construct the formula. Currently only “additive” is supported. Defaults to “additive”.

Raises:

Exception – “This formula type is not supported.’” “Please provide a string formula”

Returns:

A string representing the formula

Return type:

str

Examples

formula_generator(['a','b','c'])
# returns 'c + b + a'
balance.util.get_items_from_list_via_indices(a_list: List[Any], indices: List[int]) List[Any][source]

Gets a subset of items from a list via indices

Source code (there doesn’t seem to be a better solution): https://stackoverflow.com/a/6632209

Parameters:
  • a_list (List[Any]) – a list of items to extract a list from

  • indices (List[int]) – a list of indexes of items to get

Returns:

a list of extracted items

Return type:

List[Any]

Examples

l = ["a", "b", "c", "d"]
get_items_from_list_via_indices(l, [2, 0])
    # ['c', 'a']

get_items_from_list_via_indices(l, [100])
    # IndexError
balance.util.guess_id_column(dataset: DataFrame, column_name: str | None = None)[source]

Guess the id column of a given dataset. Possible values for guess: ‘id’.

Parameters:
  • dataset (pd.DataFrame) – dataset to guess id column

  • column_name (str, optional) – Given id column name. Defaults to None, which will guess the id column or raise exception.

Returns:

name of guessed id column

Return type:

str

balance.util.model_matrix(sample: DataFrame | Any, target: DataFrame | Any | None = None, variables: List | None = None, add_na: bool = True, return_type: str = 'two', return_var_type: str = 'dataframe', formula: List[str] | None = None, penalty_factor: List[float] | None = None, one_hot_encoding: bool = False) Dict[str, List[str] | ndarray[Any, dtype[ScalarType]] | DataFrame | csc_matrix | None][source]

Create a model matrix from a sample (and target). The default is to use an additive formula for all variables (or the ones specified). Can also create a custom model matrix if a formula is provided.

Parameters:
  • sample (Union[pd.DataFrame, Any]) – The Samples from which to create the model matrix. This can either be a DataFrame or a Sample object.

  • target (Union[pd.DataFrame, Any, None], optional) – See sample. Defaults to None. This can either be a DataFrame or a Sample object.

  • variables (Optional[List]) – the names of the variables to include (when ‘None’ then all joint variables to target and sample are used). Defaults to None.

  • add_na (bool, optional) – whether to call add_na_indicator on the data before constructing the matrix.If add_na = True, then the function add_na_indicator is applied, i.e. if a column in the DataFrame contains NAs, replace these with 0 or “_NA”, and add another column of an indicator variable for which rows were NA. If add_na is False, observations with any missing data will be omitted from the model. Defaults to True.

  • return_type (str, optional) – whether to return a single matrix (‘one’), or a dict of sample and target matrices. Defaults to “two”.

  • return_var_type (str, optional) – whether to return a “dataframe” (pd.dataframe) a “matrix” (np.ndarray) (i.e. only values of the output dataframe), or a “sparse” matrix. Defaults to “dataframe”.

  • formula (Optional[List[str]], optional) – according to what formula to construct the matrix. If no formula is provided an additive formula is applied. This may be a string or a list of strings representing different parts of the formula that will be concated together. Default is None, which will create an additive formula from the available variables. Defaults to None.

  • penalty_factor (Optional[List[float]], optional) – the penalty used in the glmnet function in ipw. The penalty should have the same length as the formula list. If not provided, assume the same penalty for all variables. Defaults to None.

  • one_hot_encoding (bool, optional) – whether to encode all factor variables in the model matrix with one_hot_encoding_greater_2. This is recommended in case of using LASSO on the data (Default: False). one_hot_encoding_greater_2 creates one-hot-encoding for all categorical variables with more than 2 categories (i.e. the number of columns will be equal to the number of categories), and only 1 column for variables with 2 levels (treatment contrast). Defaults to False.

Returns:

a dict of:
  1. ”model_matrix_columns_names”: columns names of the model matrix

  2. ”penalty_factor “: a penalty_factor for each column in the model matrix

3. “model_matrix” (or: “sample” and “target”): the DataFrames for the sample and target (one or two, according to return_type)

If return_sparse=”True” returns a sparse matrix (csc_matrix)

Return type:

Dict[ str, Union[List[str], np.ndarray, Union[pd.DataFrame, np.ndarray, csc_matrix], None] ]

Examples

import pandas as pd
d = {'a': ['a1','a2','a1','a1'], 'b': ['b1','b2','b3','b3']}
df = pd.DataFrame(data=d)

model_matrix(df)
    # {'model_matrix_columns_names': ['b[b1]', 'b[b2]', 'b[b3]', 'a[T.a2]'],
    #  'penalty_factor': array([1, 1, 1, 1]),
    #  'sample':    b[b1]  b[b2]  b[b3]  a[T.a2]
    #  0    1.0    0.0    0.0      0.0
    #  1    0.0    1.0    0.0      1.0
    #  2    0.0    0.0    1.0      0.0
    #  3    0.0    0.0    1.0      0.0,
    #  'target': None}

model_matrix(df, formula = 'a*b')
    # {'model_matrix_columns_names': ['a[a1]',
    #   'a[a2]',
    #   'b[T.b2]',
    #   'b[T.b3]',
    #   'a[T.a2]:b[T.b2]',
    #   'a[T.a2]:b[T.b3]'],
    #  'penalty_factor': array([1, 1, 1, 1, 1, 1]),
    #  'sample':    a[a1]  a[a2]  b[T.b2]  b[T.b3]  a[T.a2]:b[T.b2]  a[T.a2]:b[T.b3]
    #  0    1.0    0.0      0.0      0.0              0.0              0.0
    #  1    0.0    1.0      1.0      0.0              1.0              0.0
    #  2    1.0    0.0      0.0      1.0              0.0              0.0
    #  3    1.0    0.0      0.0      1.0              0.0              0.0,
    #  'target': None}

model_matrix(df, formula = ['a','b'], penalty_factor=[1,2])
    # {'model_matrix_columns_names': ['a[a1]', 'a[a2]', 'b[b1]', 'b[b2]', 'b[b3]'],
    #  'penalty_factor': array([1, 1, 2, 2, 2]),
    #  'sample':    a[a1]  a[a2]  b[b1]  b[b2]  b[b3]
    #  0    1.0    0.0    1.0    0.0    0.0
    #  1    0.0    1.0    0.0    1.0    0.0
    #  2    1.0    0.0    0.0    0.0    1.0
    #  3    1.0    0.0    0.0    0.0    1.0,
    #  'target': None}

model_matrix(df, formula = ['a','b'], penalty_factor=[1,2], one_hot_encoding=True)
    # {'model_matrix_columns_names': ['C(a, one_hot_encoding_greater_2)[a2]',
    #   'C(b, one_hot_encoding_greater_2)[b1]',
    #   'C(b, one_hot_encoding_greater_2)[b2]',
    #   'C(b, one_hot_encoding_greater_2)[b3]'],
    #  'penalty_factor': array([1, 2, 2, 2]),
    #  'sample':    C(a, one_hot_encoding_greater_2)[a2]  ...  C(b, one_hot_encoding_greater_2)[b3]
    #  0                                0.0  ...                                0.0
    #  1                                1.0  ...                                0.0
    #  2                                0.0  ...                                1.0
    #  3                                0.0  ...                                1.0
    # [4 rows x 4 columns],
    # 'target': None}

model_matrix(df, formula = ['a','b'], penalty_factor=[1,2], return_sparse = True)
    # {'model_matrix_columns_names': ['a[a1]', 'a[a2]', 'b[b1]', 'b[b2]', 'b[b3]'],
    #  'penalty_factor': array([1, 1, 2, 2, 2]),
    #  'sample': <4x5 sparse matrix of type '<class 'numpy.float64'>'
    #       with 8 stored elements in Compressed Sparse Column format>,
    #  'target': None}

model_matrix(df, target = df)
    # {'model_matrix_columns_names': ['b[b1]', 'b[b2]', 'b[b3]', 'a[T.a2]'],
    #  'penalty_factor': array([1, 1, 1, 1]),
    #  'sample':    b[b1]  b[b2]  b[b3]  a[T.a2]
    #  0    1.0    0.0    0.0      0.0
    #  1    0.0    1.0    0.0      1.0
    #  2    0.0    0.0    1.0      0.0
    #  3    0.0    0.0    1.0      0.0,
    #  'target':    b[b1]  b[b2]  b[b3]  a[T.a2]
    #  0    1.0    0.0    0.0      0.0
    #  1    0.0    1.0    0.0      1.0
    #  2    0.0    0.0    1.0      0.0
    #  3    0.0    0.0    1.0      0.0}

model_matrix(df, target = df, return_type = "one")
    # {'model_matrix_columns_names': ['b[b1]', 'b[b2]', 'b[b3]', 'a[T.a2]'],
    #  'penalty_factor': array([1, 1, 1, 1]),
    #  'model_matrix':    b[b1]  b[b2]  b[b3]  a[T.a2]
    #  0    1.0    0.0    0.0      0.0
    #  1    0.0    1.0    0.0      1.0
    #  2    0.0    0.0    1.0      0.0
    #  3    0.0    0.0    1.0      0.0
    #  0    1.0    0.0    0.0      0.0
    #  1    0.0    1.0    0.0      1.0
    #  2    0.0    0.0    1.0      0.0
    #  3    0.0    0.0    1.0      0.0}

model_matrix(df, target = df, formula=['a','b'],return_type = "one")
    # {'model_matrix_columns_names': ['a[a1]', 'a[a2]', 'b[b1]', 'b[b2]', 'b[b3]'],
    #  'penalty_factor': array([1, 1, 1, 1, 1]),
    #  'model_matrix':    a[a1]  a[a2]  b[b1]  b[b2]  b[b3]
    #  0    1.0    0.0    1.0    0.0    0.0
    #  1    0.0    1.0    0.0    1.0    0.0
    #  2    1.0    0.0    0.0    0.0    1.0
    #  3    1.0    0.0    0.0    0.0    1.0
    #  0    1.0    0.0    1.0    0.0    0.0
    #  1    0.0    1.0    0.0    1.0    0.0
    #  2    1.0    0.0    0.0    0.0    1.0
    #  3    1.0    0.0    0.0    0.0    1.0}
class balance.util.one_hot_encoding_greater_2(reference: int = 0)[source]

This class creates a special encoding for factor variable to be used in a LASSO model. For variables with exactly two levels using this in dmatrix will only keep one level, i.e. will create one column with a 0 or 1 indicator for one of the levels. The level kept will be the second one, based on loxicographical order of the levels. For variables with more than 2 levels, using this in dmatrix will keep all levels as columns of the matrix.

References: 1. More about this encoding: # https://stats.stackexchange.com/questions/69804/group-categorical-variables-in-glmnet/107958#107958 3. Source code: adaptation of # https://patsy.readthedocs.io/en/latest/categorical-coding.html

Examples

import pandas as pd d = {‘a’: [‘a1’,’a2’,’a1’,’a1’], ‘b’: [‘b1’,’b2’,’b3’,’b3’], ‘c’: [‘c1’,’c1’,’c2’,’c1’], ‘d’:[‘d1’,’d1’,’d2’,’d3’]} df = pd.DataFrame(data=d)

print(dmatrix(‘C(a, one_hot_encoding_greater_2)’, df, return_type = ‘dataframe’))

# Intercept C(a, one_hot_encoding_greater_2)[a2] #0 1.0 0.0 #1 1.0 1.0 #2 1.0 0.0 #3 1.0 0.0

print(dmatrix(‘C(a, one_hot_encoding_greater_2)-1’, df, return_type = ‘dataframe’))

# C(a, one_hot_encoding_greater_2)[a2] #0 0.0 #1 1.0 #2 0.0 #3 0.0

print(dmatrix(‘C(b, one_hot_encoding_greater_2)’, df, return_type = ‘dataframe’))

# Intercept C(b, one_hot_encoding_greater_2)[b1] #0 1.0 1.0 #1 1.0 0.0 #2 1.0 0.0 #3 1.0 0.0 # # C(b, one_hot_encoding_greater_2)[b2] C(b, one_hot_encoding_greater_2)[b3] #0 0.0 0.0 #1 1.0 0.0 #2 0.0 1.0 #3 0.0 1.0

print(dmatrix(‘C(b, one_hot_encoding_greater_2)-1’, df, return_type = ‘dataframe’))

# C(b, one_hot_encoding_greater_2)[b1] C(b, one_hot_encoding_greater_2)[b2] #0 1.0 0.0 #1 0.0 1.0 #2 0.0 0.0 #3 0.0 0.0 # # C(b, one_hot_encoding_greater_2)[b3] #0 0.0 #1 0.0 #2 1.0

d = {‘a’: [‘a1’,’a1’,’a1’,’a1’], ‘b’: [‘b1’,’b2’,’b3’,’b3’]} df = pd.DataFrame(data=d)

print(dmatrix(‘C(a, one_hot_encoding_greater_2)-1’, df, return_type = ‘dataframe’)) print(dmatrix(‘C(a, one_hot_encoding_greater_2):C(b, one_hot_encoding_greater_2)-1’, df, return_type = ‘dataframe’)) C(a, one_hot_encoding_greater_2)[a1]

#0 1.0 #1 1.0 #2 1.0 #3 1.0 # C(a, one_hot_encoding_greater_2)[a1]:C(b, one_hot_encoding_greater_2)[b1] #0 1.0 #1 0.0 #2 0.0 #3 0.0 # # C(a, one_hot_encoding_greater_2)[a1]:C(b, one_hot_encoding_greater_2)[b2] #0 0.0 #1 1.0 #2 0.0 #3 0.0

balance.util.process_formula(formula, variables: List, factor_variables=None)[source]
Process a formula string:
  1. Expand . notation using dot_expansion function

  2. Remove intercept (if using ipw, it will be added automatically by cvglmnet)

3. If factor_variables is not None, one_hot_encoding_greater_2 is applied to factor_variables

Parameters:
  • formula – A string representing the formula

  • variables (List) – list of all variables to include (usually all variables in data)

  • factor_variables – list of names of factor variables that we use one_hot_encoding_greater_2 for. Note that these should be also part of variables. Default is None, in which case no special contrasts are applied (using patsy defaults). one_hot_encoding_greater_2 creates one-hot-encoding for all categorical variables with more than 2 categories (i.e. the number of columns will be equal to the number of categories), and only 1 column for variables with 2 levels (treatment contrast).

Raises:

Exception – “Not all factor variables are contained in variables”

Returns:

a ModelDesc object to build a model matrix using patsy.dmatrix.

Examples

f1 = process_formula('a:(b+aab)', ['a','b','aab'])
print(f1)
    # ModelDesc(lhs_termlist=[],
    #       rhs_termlist=[Term([EvalFactor('a'), EvalFactor('b')]),
    #                     Term([EvalFactor('a'), EvalFactor('aab')])])
f2 = process_formula('a:(b+aab)', ['a','b','aab'], ['a','b'])
print(f2)
    # ModelDesc(lhs_termlist=[],
    #       rhs_termlist=[Term([EvalFactor('C(a, one_hot_encoding_greater_2)'),
    #                           EvalFactor('C(b, one_hot_encoding_greater_2)')]),
    #                     Term([EvalFactor('C(a, one_hot_encoding_greater_2)'),
    #                           EvalFactor('aab')])])
balance.util.qcut(s, q, duplicates: str = 'drop', **kwargs)[source]

Discretize variable into equal-sized buckets based quantiles. This is a wrapper to pandas qcut function.

Parameters:
  • s (_type_) – 1d ndarray or Series.

  • q (_type_) – Number of quantiles (int or float).

  • duplicates (str, optional) – whether to drop non unique bin edges or raise error (“raise” or “drop”). Defaults to “drop”.

Returns:

Series of type object with intervals.

balance.util.quantize(df: DataFrame | Series, q: int = 10, variables=None) DataFrame[source]

Cut numeric variables of a DataFrame into quantiles buckets

Parameters:
  • df (Union[pd.DataFrame, pd.Series]) – a DataFrame to transform

  • q (int, optional) – Number of buckets to create for each variable. Defaults to 10.

  • variables (optional) – variables to transform. If None, all numeric variables are transformed. Defaults to None.

Returns:

DataFrame after quantization. numpy.nan values are kept as is.

Return type:

pd.DataFrame

Examples

from balance.util import quantize
import numpy as np

df = pd.DataFrame({"a": [1,1,2,20,22,23,np.nan], "b": range(7), "c": range(7), "d": [1,1,np.nan,20,5,23,np.nan]})
print(quantize(df, q = 3))

    #             b               d              c                a
    # 0  (-0.001, 2.0]  (0.999, 2.333]  (-0.001, 2.0]   (0.999, 1.667]
    # 1  (-0.001, 2.0]  (0.999, 2.333]  (-0.001, 2.0]   (0.999, 1.667]
    # 2  (-0.001, 2.0]             NaN  (-0.001, 2.0]  (1.667, 20.667]
    # 3     (2.0, 4.0]    (15.0, 23.0]     (2.0, 4.0]  (1.667, 20.667]
    # 4     (2.0, 4.0]   (2.333, 15.0]     (2.0, 4.0]   (20.667, 23.0]
    # 5     (4.0, 6.0]    (15.0, 23.0]     (4.0, 6.0]   (20.667, 23.0]
    # 6     (4.0, 6.0]             NaN     (4.0, 6.0]              NaN
balance.util.rm_mutual_nas(*args) List[source]

Remove entries in a position which is na or infinite in any of the arguments.

Ignores args which are None.

Can accept multiple array-like arguments or a single array-like argument. Handles pandas and numpy arrays.

Raises:
  • ValueError – If any argument is not array-like. (see: _is_arraylike())

  • ValueError – If arguments include arrays of different lengths.

Returns:

A list containing the original input arrays, after removing elements that have a missing or infinite value in the same position as any of the other arrays.

Return type:

List

Examples

import pandas as pd
import numpy as np

x1 = pd.array([1,2, None, np.nan, pd.NA, 3])
x2 = pd.array([1.1,2,3, None, np.nan, pd.NA])
x3 = pd.array([1.1,2,3, 4,5,6])
x4 = pd.array(["1.1",2,3, None, np.nan, pd.NA])
x5 = pd.array(["1.1","2","3", None, np.nan, pd.NA], dtype = "string")
x6 = np.array([1,2,3.3,4,5,6])
x7 = np.array([1,2,3.3,4,"5","6"])
x8 = [1,2,3.3,4,"5","6"]
(x1,x2, x3, x4, x5, x6, x7, x8)
    # (<IntegerArray>
    # [1, 2, <NA>, <NA>, <NA>, 3]
    # Length: 6, dtype: Int64,
    # <PandasArray>
    # [1.1, 2, 3, None, nan, <NA>]
    # Length: 6, dtype: object,
    # <PandasArray>
    # [1.1, 2.0, 3.0, 4.0, 5.0, 6.0]
    # Length: 6, dtype: float64,
    # <PandasArray>
    # ['1.1', 2, 3, None, nan, <NA>]
    # Length: 6, dtype: object,
    # <StringArray>
    # ['1.1', '2', '3', <NA>, <NA>, <NA>]
    # Length: 6, dtype: string,
    # array([1. , 2. , 3.3, 4. , 5. , 6. ]),
    # array(['1', '2', '3.3', '4', '5', '6'], dtype='<U32'),
    # [1, 2, 3.3, 4, '5', '6'])

from balance.util import rm_mutual_nas
rm_mutual_nas(x1,x2, x3, x4, x5,x6,x7,x8)
    # [<IntegerArray>
    #  [1, 2]
    #  Length: 2, dtype: Int64,
    #  <PandasArray>
    #  [1.1, 2]
    #  Length: 2, dtype: object,
    #  <PandasArray>
    #  [1.1, 2.0]
    #  Length: 2, dtype: float64,
    #  <PandasArray>
    #  ['1.1', 2]
    #  Length: 2, dtype: object,
    #  <StringArray>
    #  ['1.1', '2']
    #  Length: 2, dtype: string,
    #  array([1., 2.]),
    #  array(['1', '2'], dtype='<U32'),
    #  [1, 2]]

    # Preserve the index values in the resulting pd.Series:
    x1 = pd.Series([1, 2, 3, 4])
    x2 = pd.Series([np.nan, 2, 3, 4])
    x3 = np.array([1, 2, 3, 4])
    print(rm_mutual_nas(x1, x2)[0])
    print(rm_mutual_nas(x1.sort_values(ascending=False), x2)[0])
    print(rm_mutual_nas(x1, x3)[0])
        # 1    2
        # 2    3
        # 3    4
        # dtype: int64
        # 3    4
        # 2    3
        # 1    2
        # dtype: int64
        # 0    1
        # 1    2
        # 2    3
        # 3    4
        # dtype: int64
balance.util.row_pairwise_diffs(df: DataFrame) DataFrame[source]

Produce the differences between every pair of rows of df

Parameters:

df (pd.DataFrame) – DataFrame

Returns:

DataFrame with differences between all combinations of rows

Return type:

pd.DataFrame

Examples

d = pd.DataFrame({"a": (1, 2, 3), "b": (-42, 8, 2)})
row_pairwise_diffs(d)
    #        a   b
    # 0      1 -42
    # 1      2   8
    # 2      3   2
    # 1 - 0  1  50
    # 2 - 0  2  44
    # 2 - 1  1  -6