balance.stats_and_plots.weighted_comparisons_plots

balance.stats_and_plots.weighted_comparisons_plots.naming_legend(object_name: str, names_of_dfs: List[str]) str[source]

Returns a name for a legend of a plot given the other dfs. If one of the dfs we would like to plot is “unadjusted”, it means that the Sample object contains the adjusted object as self. If not, then the self object is sample.

Parameters:
  • object_name (str) – the name of the object to plot.

  • names_of_dfs (List[str]) – the names of the other dfs to plot.

Returns:

a string with the desired name

Return type:

str

Examples

naming_legend('self', ['self', 'target', 'unadjusted']) #'adjusted'
naming_legend('unadjusted', ['self', 'target', 'unadjusted']) #'sample'
naming_legend('self', ['self', 'target']) #'sample'
naming_legend('other_name', ['self', 'target']) #'other_name'
balance.stats_and_plots.weighted_comparisons_plots.plot_bar(dfs: List[Dict[str, DataFrame | Series]], names: List[str], column: str, axis: Axes | None = None, weighted: bool = True, title: str | None = None, ylim: Tuple[float, float] | None = None) None[source]

Shows a (weighted) sns.barplot using a relative frequency table of several DataFrames (with optional control over the y-axis limits).

If weighted is True, then mutual NA values are removed using rm_mutual_nas().

Parameters:
  • dfs (List[Dict[str, Union[pd.DataFrame, pd.Series]]]) –

    a list (of length 1 or more) of dictionaries which describe the DataFrames and weights The structure is as follows: [

    {‘df’: pd.DataFrame(…), “weight”: pd.Series(…)}, …

    ] The ‘df’ is a DataFrame which includes the column name that was supplied through ‘column’. The “weight” is a pd.Series of weights that are used when aggregating the variable using relative_frequency_table().

  • names (List[str]) – a list of the names of the DataFrames that are plotted. E.g.: [‘adjusted’, ‘unadjusted’, ‘target’]

  • column (str) – The column to be used to aggregate using relative_frequency_table().

  • axis (Optional[plt.Axes], optional) – matplotlib Axes object to draw the plot onto, otherwise uses the current Axes. Defaults to None.

  • weighted (bool, optional) – If to pass the weights from the dicts inside dfs. Defaults to True.

  • title (str, optional) – Title of the plot. Defaults to “barplot of covar ‘{column}’”.

  • ylim (Optional[Tuple[float, float]], optional) – A tuple with two float values representing the lower and upper limits of the y-axis. If not provided, the y-axis range is determined automatically. Defaults to None.

Examples

from balance.stats_and_plots.weighted_comparisons_plots import plot_bar
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ('a', 'b', 'c', 'c'),
    'v1': (1, 2, 3, 4),
})

plot_bar(
    [{"df": df, "weight": pd.Series((1, 1, 1, 1))}, {"df": df, "weight": pd.Series((2, 1, 1, 1))}],
    names = ["self", "target"],
    column = "group",
    axis = None,
    weighted = True)

# The same as above just with ylim set to (0, 1).
plot_bar(
    [{"df": df, "weight": pd.Series((1, 1, 1, 1))}, {"df": df, "weight": pd.Series((2, 1, 1, 1))}],
    names = ["self", "target"],
    column = "group",
    axis = None,
    weighted = True,
    ylim = (0, 1))

# Also deals with np.nan weights
a = plot_bar(
    [{"df": df, "weight": pd.Series((1, 1, 1, np.nan))}, {"df": df, "weight": pd.Series((2, 1, 1, np.nan))}],
    names = ["self", "target"],
    column = "group",
    axis = None,
    weighted = True)
balance.stats_and_plots.weighted_comparisons_plots.plot_dist(dfs: List[Dict[str, DataFrame | Series]], names: List[str] | None = None, variables: List[str] | None = None, numeric_n_values_threshold: int = 15, weighted: bool = True, dist_type: Literal['qq', 'hist', 'kde', 'ecdf'] | None = None, library: Literal['plotly', 'seaborn'] = 'plotly', ylim: Tuple[float, float] | None = None, **kwargs) List | ndarray[Any, dtype[_ScalarType_co]] | Dict[str, Figure] | None[source]

Plots the variables of a DataFrame by using either seaborn or plotly.

If using plotly then using kde (or qq) plots for numeric variables and bar plots for categorical variables. Uses plotly_plot_dist(). If using seaborn then various types of plots are possible for the variables (see dist_type for details). Uses seaborn_plot_dist()

Parameters:
  • dfs (List[Dict[str, Union[pd.DataFrame, pd.Series]]]) –

    a list (of length 1 or more) of dictionaries which describe the DataFrames and weights The structure is as follows: [

    {‘df’: pd.DataFrame(…), “weight”: pd.Series(…)}, …

    ] The ‘df’ is a DataFrame which includes the column name that was supplied through ‘column’. The “weight” is a pd.Series of weights that are used when aggregating the variable using relative_frequency_table().

  • names (List[str]) – a list of the names of the DataFrames that are plotted. E.g.: [‘adjusted’, ‘unadjusted’, ‘target’] If None, then all DataFrames will be plotted, but only if library == “seaborn”. (TODO: to remove this restriction)

  • variables (Optional[List[str]], optional) – a list of variables to use for plotting. Default (i.e.: if None) is to use the list of all variables.

  • numeric_n_values_threshold (int, optional) – How many numbers should be in a column so that it is considered to be a “category”? Defaults to 15.

  • weighted (bool, optional) – If to use the weights with the plots. Defaults to True.

  • dist_type (Literal["kde", "hist", "qq", "ecdf"], optional) – The type of plot to draw. The ‘qq’ and ‘kde’ options are available for library=”plotly”, While all options are available if using library=”seaborn”. Defaults to “kde”.

  • library (Literal["plotly", "seaborn"], optional) – Whichever library to use for the plot. Defaults to “plotly”.

  • ylim (Optional[Tuple[float, float]], optional) – A tuple with two float values representing the lower and upper limits of the y-axis. If not provided, the y-axis range is determined automatically. Defaults to None. passed to bar plots only.

  • **kwargs – Additional keyword arguments to pass to plotly_plot_dist or seaborn_plot_dist.

Raises:

ValueError – if library is not in (“plotly”, “seaborn”).

Returns:

If library=”plotly” then returns a dictionary containing plots if return_dict_of_figures is True. None otherwise. If library=”seaborn” then returns None, unless return_axes is True. Then either a list or an np.array of matplotlib axis.

Return type:

Union[Union[List, np.ndarray], Dict[str, go.Figure], None]

Examples

import numpy as np
import pandas as pd
from numpy import random
from balance.stats_and_plots.weighted_comparisons_plots import plotly_plot_bar

random.seed(96483)

df = pd.DataFrame({
    'v1': random.random_integers(11111, 11114, size=100).astype(str),
    'v2': random.normal(size = 100),
    'v3': random.uniform(size = 100),
}).sort_values(by=['v2'])

dfs1 = [
    {"df": df, "weight": pd.Series(random.random(size = 100) + 0.5)},
    {"df": df, "weight": pd.Series(np.ones(99).tolist() + [1000])},
    {"df": df, "weight": pd.Series(np.ones(100))},
]


from balance.stats_and_plots.weighted_comparisons_plots import plot_dist

# defaults to plotly with bar and qq plots. Returns None.
plot_dist(dfs1, names=["self", "unadjusted", "target"])

# Using seaborn, deafults to kde plots
plot_dist(dfs1, names=["self", "unadjusted", "target"], library="seaborn") # like using dist_type = "kde"
plot_dist(dfs1, names=["self", "unadjusted", "target"], library="seaborn", dist_type = "hist")
plot_dist(dfs1, names=["self", "unadjusted", "target"], library="seaborn", dist_type = "qq")
plot_dist(dfs1, names=["self", "unadjusted", "target"], library="seaborn", dist_type = "ecdf")

plot_dist(dfs1, names=["self", "unadjusted", "target"], ylim = (0,1))
plot_dist(dfs1, names=["self", "unadjusted", "target"], library="seaborn", dist_type = "qq", ylim = (0,1))
balance.stats_and_plots.weighted_comparisons_plots.plot_hist_kde(dfs: List[Dict[str, DataFrame | Series]], names: List[str], column: str, axis: Axes | None = None, weighted: bool = True, dist_type: Literal['hist', 'kde', 'ecdf'] = 'hist', title: str | None = None) None[source]

Shows a (weighted) distribution plot ():func:sns.displot) of data from several DataFrame objects.

Options include histogram (hist), kernel density estimate (kde), and empirical cumulative density function (ecdf).

Parameters:
  • dfs (List[Dict[str, Union[pd.DataFrame, pd.Series]]]) –

    a list (of length 1 or more) of dictionaries which describe the DataFrames and weights The structure is as follows: [

    {‘df’: pd.DataFrame(…), “weight”: pd.Series(…)}, …

    ] The ‘df’ is a DataFrame which includes the column name that was supplied through ‘column’. The “weight” is a pd.Series of weights that are used when aggregating the variable using relative_frequency_table().

  • names (List[str]) – a list of the names of the DataFrames that are plotted. E.g.: [‘adjusted’, ‘unadjusted’, ‘target’]

  • column (str) – The column to be used to aggregate using relative_frequency_table().

  • axis (Optional[plt.Axes], optional) – matplotlib Axes object to draw the plot onto, otherwise uses the current Axes. Defaults to None.

  • weighted (bool, optional) – If to pass the weights from the dicts inside dfs. Defaults to True.

  • dist_type (Literal["hist", "kde", "ecdf"], optional) – The type of plot to draw. Defaults to “hist”.

  • title (str, optional) – Title of the plot. Defaults to “distribution plot of covar ‘{column}’”.

Examples

from balance.stats_and_plots.weighted_comparisons_plots import plot_hist_kde
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'group': ('a', 'b', 'c', 'c'),
    'v1': (1, 2, 3, 4),
})

dfs1 = [{"df": pd.DataFrame(pd.Series([1,2,2,2,3,4,5,5,7,8,9,9,9,9,5,2,5,4,4,4], name = "v1")), "weight": None}, {"df": df, "weight": pd.Series((200, 1, 0, 20))}]

plt.figure(1)

# kde: no weights
plot_hist_kde(
    dfs1,
    names = ["self", "target"],
    column = "v1",
    axis = None,
    weighted = False, dist_type = "kde")

plt.figure(2)

# kde: with weights
plot_hist_kde(
    dfs1,
    names = ["self", "target"],
    column = "v1",
    axis = None,
    weighted = True, dist_type = "kde")

plt.figure(3)

# hist
plot_hist_kde(
    dfs1,
    names = ["self", "target"],
    column = "v1",
    axis = None,
    weighted = True, dist_type = "hist")

plt.figure(4)

# ecdf
plot_hist_kde(
    dfs1,
    names = ["self", "target"],
    column = "v1",
    axis = None,
    weighted = True, dist_type = "ecdf")


# can work nicely with plt.subplots:
f, axes = plt.subplots(1, 2, figsize=(7, 7 * 1))
plot_hist_kde(
    dfs1,
    names = ["self", "target"],
    column = "v1",
    axis = axes[0],
    weighted = False, dist_type = "kde")
plot_hist_kde(
    dfs1,
    names = ["self", "target"],
    column = "v1",
    axis = axes[1],
    weighted = False, dist_type = "kde")
balance.stats_and_plots.weighted_comparisons_plots.plot_qq(dfs: List[Dict[str, DataFrame | Series]], names: List[str], column: str, axis: Axes | None = None, weighted: bool = True) None[source]

Plots a qq plot of the weighted data from a DataFrame object against some target.

See: https://en.wikipedia.org/wiki/Q-Q_plot

Parameters:
  • dfs (List[Dict[str, Union[pd.DataFrame, pd.Series]]]) –

    a list (of length 1 or more) of dictionaries which describe the DataFrames and weights The structure is as follows: [

    {‘df’: pd.DataFrame(…), “weight”: pd.Series(…)}, …

    ] The ‘df’ is a DataFrame which includes the column name that was supplied through ‘column’. The “weight” is a pd.Series of weights that are used when aggregating the variable using weighted_quantile(). Uses the last df item in the list as the target.

  • names (List[str]) – a list of the names of the DataFrames that are plotted. E.g.: [‘adjusted’, ‘unadjusted’, ‘target’]

  • column (str) – The column to be used to aggregate using weighted_quantile().

  • axis (Optional[plt.Axes], optional) – matplotlib Axes object to draw the plot onto, otherwise uses the current Axes. Defaults to None.

  • weighted (bool, optional) – If to pass the weights from the dicts inside dfs. Defaults to True.

Examples

import numpy as np
import pandas as pd
from balance.stats_and_plots.weighted_comparisons_plots import plot_qq
from numpy import random

df = pd.DataFrame({
    'v1': random.uniform(size=100),
}).sort_values(by=['v1'])

dfs1 = [
    {"df": df, "weight": pd.Series(np.ones(100))},
    {"df": df, "weight": pd.Series(range(100))},
    {"df": df, "weight": pd.Series(np.ones(100))},
]

# plot_qq(dfs1, names=["self", "unadjusted", "target"], column="v1", axis=None, weighted=False)
plot_qq(dfs1, names=["self", "unadjusted", "target"], column="v1", axis=None, weighted=True)
balance.stats_and_plots.weighted_comparisons_plots.plot_qq_categorical(dfs: List[Dict[str, DataFrame | Series]], names: List[str], column: str, axis: Axes | None = None, weighted: bool = True, label_threshold: int = 30) None[source]

A scatter plot of weighted relative frequencies of categories from each df.

Notice that this is not a “real” qq-plot, but rather a scatter plot of (estimated, weighted) probabilities for each category.

X-axis is the sample (adjusted, unadjusted) and Y-axis is the target.

Parameters:
  • dfs (List[Dict[str, Union[pd.DataFrame, pd.Series]]]) –

    a list (of length 1 or more) of dictionaries which describe the DataFrames and weights The structure is as follows: [

    {‘df’: pd.DataFrame(…), “weight”: pd.Series(…)}, …

    ] The ‘df’ is a DataFrame which includes the column name that was supplied through ‘column’. The “weight” is a pd.Series of weights that are used when aggregating the variable using weighted_quantile(). Uses the last df item in the list as the target.

  • names (List[str]) – a list of the names of the DataFrames that are plotted. E.g.: [‘adjusted’, ‘unadjusted’, ‘target’]

  • column (str) – The column to be used to aggregate using relative_frequency_table().

  • axis (Optional[plt.Axes], optional) – matplotlib Axes object to draw the plot onto, otherwise uses the current Axes. Defaults to None.

  • weighted (bool, optional) – If to pass the weights from the dicts inside dfs. Defaults to True.

  • label_threshold (int, optional) – All labels that are larger from the threshold will be omitted from the scatter plot (so to reduce clutter). Defaults to 30.

Examples

import numpy as np
import pandas as pd
from balance.stats_and_plots.weighted_comparisons_plots import plot_qq_categorical
from numpy import random

df = pd.DataFrame({
    'v1': random.random_integers(11111, 11114, size=100),
}).sort_values(by=['v1'])

dfs1 = [
    {"df": df, "weight": pd.Series(np.ones(100))},
    {"df": df, "weight": pd.Series(np.ones(99).tolist() + [1000])},
    {"df": df, "weight": pd.Series(np.ones(100))},
]

import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (20, 6) # (w, h)

fig, axs = plt.subplots(1,3)

# Without using weights
plot_qq_categorical(dfs1, names=["self", "unadjusted", "target"], column="v1", axis=axs[0], weighted=False)
# With weights
plot_qq_categorical(dfs1, names=["self", "unadjusted", "target"], column="v1", axis=axs[1], weighted=True)
# With label trimming if the text is longer than 3.
plot_qq_categorical(dfs1, names=["self", "unadjusted", "target"], column="v1", axis=axs[2], weighted=True, label_threshold=3)
balance.stats_and_plots.weighted_comparisons_plots.plotly_plot_bar(dict_of_dfs: Dict[str, DataFrame], variables: List[str], plot_it: bool = True, return_dict_of_figures: bool = False, ylim: Tuple[float, float] | None = None, **kwargs) Dict[str, Figure] | None[source]

Plots interactive bar plots of the given variables (with optional control over the y-axis limits).

Parameters:
  • dict_of_dfs (Dict[str, pd.DataFrame]) – A dictionary with keys as names of the DataFrame (e.g., ‘self’, ‘unadjusted’, ‘target’), and values as the DataFrames containing the variables to plot.

  • variables (List[str]) – A list of variables to use for plotting.

  • plot_it (bool, optional) – If True, plots the graphs interactively instead of returning a dictionary. Defaults to True.

  • return_dict_of_figures (bool, optional) – If True, returns the dictionary containing the plots rather than just returning None. Defaults to False.

  • ylim (Optional[Tuple[float, float]], optional) – A tuple with two float values representing the lower and upper limits of the y-axis. If not provided, the y-axis range is determined automatically. Defaults to None.

  • **kwargs – Additional keyword arguments to pass to the update_layout method of the plotly figure object. (e.g.: width and height are 700 and 450, and could be set using the kwargs).

Returns:

Dictionary containing plots if return_dict_of_figures is True. None otherwise.

Return type:

Optional[Dict[str, go.Figure]]

Examples

import numpy as np
import pandas as pd
from numpy import random
from balance.stats_and_plots.weighted_comparisons_plots import plotly_plot_bar

random.seed(96483)

df = pd.DataFrame({
    'v1': random.random_integers(11111, 11114, size=100).astype(str),
    'v2': random.normal(size = 100),
    'v3': random.uniform(size = 100),
}).sort_values(by=['v2'])

dict_of_dfs = {
    "self": pd.concat([df, pd.Series(random.random(size = 100) + 0.5, name = "weight")], axis = 1),
    "unadjusted": pd.concat([df, pd.Series(np.ones(99).tolist() + [1000], name = "weight")], axis = 1),
    "target": pd.concat([df, pd.Series(np.ones(100), name = "weight")], axis = 1),
}

# It can work with "v2" and "v3", but it would be very sparse
plotly_plot_bar(dict_of_dfs, variables= ["v1"])

# Plots the same as above, but this time the range of the yaxis is from 0 to 1.
plotly_plot_bar(dict_of_dfs, variables= ["v1"], ylim = (0,1))
balance.stats_and_plots.weighted_comparisons_plots.plotly_plot_density(dict_of_dfs: Dict[str, DataFrame], variables: List[str], plot_it: bool = True, return_dict_of_figures: bool = False, plot_width: int = 800, **kwargs) Dict[str, Figure] | None[source]

Plots interactive density plots of the given variables using kernel density estimation.

Creates a plotly plot of the kernel density estimate for each variable in the given list across multiple DataFrames. The function assumes there is a DataFrame with the key ‘target’. The density plot shows the distribution of the variable for each DataFrame in the dictionary. It looks for a weights column and uses it to normalize the data. If no weight column is found, it assumes all weights are equal to 1. It relies on the seaborn library to create the KDE (sns.kdeplot).

Parameters:
  • dict_of_dfs (Dict[str, pd.DataFrame]) – A dictionary where each key is a name for the DataFrame and the value is the DataFrame that contains the variables to plot.

  • variables (List[str]) – A list of variables to plot.

  • plot_it (bool, optional) – Whether to plot the figures interactively using plotly. Defaults to True.

  • return_dict_of_figures (bool, optional) – Whether to return a dictionary of plotly figures. Defaults to False.

  • plot_width (int, optional) – The width of the plot in pixels. Defaults to 800.

  • **kwargs – Additional keyword arguments to pass to the update_layout method of the plotly figure object. (e.g.: width and height are 700 and 450, and could be set using the kwargs).

Returns:

A dictionary containing plotly figures for each variable in the given list if return_dict_of_figures is True. Otherwise, returns None.

Return type:

Optional[Dict[str, go.Figure]]

Examples

::

import numpy as np import pandas as pd from numpy import random from balance.stats_and_plots.weighted_comparisons_plots import plotly_plot_density, plot_dist

random.seed(96483)

df = pd.DataFrame({

‘v1’: random.random_integers(11111, 11114, size=100).astype(str), ‘v2’: random.normal(size = 100), ‘v3’: random.uniform(size = 100),

}).sort_values(by=[‘v2’])

dict_of_dfs = {

“self”: pd.concat([df, pd.Series(random.random(size = 100) + 0.5, name = “weight”)], axis = 1), “unadjusted”: pd.concat([df, pd.Series(np.ones(99).tolist() + [1000], name = “weight”)], axis = 1), “target”: pd.concat([df, pd.Series(np.ones(100), name = “weight”)], axis = 1),

}

# It won’t work with “v1” since it is not numeric. plotly_plot_density(dict_of_dfs, variables= [“v2”, “v3”], plot_width = 550)

# The above gives the same results as: dfs1 = [

{“df”: df, “weight”: dict_of_dfs[‘self’][“weight”]}, {“df”: df, “weight”: dict_of_dfs[‘unadjusted’][“weight”]}, {“df”: df, “weight”: dict_of_dfs[‘target’][“weight”]},

] plot_dist(dfs1, names=[“self”, “unadjusted”, “target”], library=”seaborn”, dist_type = “kde”, variables= [“v2”, “v3”])

# This gives the same shape of plots (notice how we must have the column “weight” for the plots to work) df = pd.DataFrame({

‘group’: (‘a’, ‘b’, ‘c’, ‘c’), ‘v1’: (1, 2, 3, 4),

})

dfs1 = [{“df”: pd.DataFrame(pd.Series([1,2,2,2,3,4,5,5,7,8,9,9,9,9,5,2,5,4,4,4], name = “v1”)), “weight”: None}, {“df”: df, “weight”: pd.Series((200, 1, 0, 200000))}] # dfs1[1]{‘df’}

dict_of_dfs = {

“self”: dfs1[0][‘df’], # pd.concat([df, pd.Series(random.random(size = 100) + 0.5, name = “weight”)], axis = 1), “target”: pd.concat([dfs1[1][‘df’], pd.Series(dfs1[1][“weight”], name = “weight”)], axis = 1),

}

plotly_plot_density(dict_of_dfs, variables= [“v1”], plot_width = 550)

plot_dist(dfs1, names=[“self”, “target”], library=”seaborn”, dist_type = “kde”, variables= [“v1”],numeric_n_values_threshold = 1)

balance.stats_and_plots.weighted_comparisons_plots.plotly_plot_dist(dict_of_dfs: Dict[str, DataFrame], variables: List[str] | None = None, numeric_n_values_threshold: int = 15, weighted: bool = True, dist_type: Literal['kde', 'qq'] | None = None, plot_it: bool = True, return_dict_of_figures: bool = False, ylim: Tuple[float, float] | None = None, **kwargs) Dict[str, Figure] | None[source]

Plots interactive distribution plots (qq and bar plots) of the given variables.

The plots compare the weighted distributions of an arbitrary number of variables from an arbitrary number of DataFrames. Numeric variables are plotted as either qq’s using plotly_plot_qq(), or as kde desnity plots using plotly_plot_density(). categorical variables as barplots using plotly_plot_bar().

Parameters:
  • dict_of_dfs (Dict[str, pd.DataFrame]) – The key is the name of the DataFrame (E.g.: self, unadjusted, target), and the value is the DataFrame that contains the variables that we want to plot.

  • variables (Optional[List[str]], optional) – a list of variables to use for plotting. Defaults (i.e.: if None) is to use the list of all variables.

  • numeric_n_values_threshold (int, optional) – How many numbers should be in a column so that it is considered to be a “category”? Defaults to 15.

  • weighted (bool, optional) – If to use the weights with the plots. Defaults to True.

  • dist_type (Optional[Literal["kde", "qq"]], optional) – The type of plot to draw (relevant only for numerical variables). Defaults to None (which fallbacks to “kde”).

  • plot_it (bool, optional) – If to plot the plots interactively instead of returning a dictionary. Defaults to True.

  • return_dict_of_figures (bool, optional) –

    If to return the dictionary containing the plots rather than just returning None. Defaults to False. If returned - the dictionary is of plots. Keys in this dictionary are the variable names for each plot. Values are plotly plot objects plotted like:

    offline.iplot(dict_of_all_plots[‘age’])

    Or simply:

    dict_of_all_plots[‘age’]

  • ylim (Optional[Tuple[float, float]], optional) – A tuple with two float values representing the lower and upper limits of the y-axis. If not provided, the y-axis range is determined automatically. Defaults to None. passed to bar plots only.

  • **kwargs – Additional keyword arguments to pass to the update_layout method of the plotly figure object. (e.g.: width and height are 700 and 450, and could be set using the kwargs).

Returns:

Dictionary containing plots if return_dict_of_figures is True. None otherwise.

Return type:

Optional[Dict[str, go.Figure]]

Examples

import numpy as np
import pandas as pd
from numpy import random
from balance.stats_and_plots.weighted_comparisons_plots import plotly_plot_dist

random.seed(96483)

df = pd.DataFrame({
    'v1': random.random_integers(11111, 11114, size=100).astype(str),
    'v2': random.normal(size = 100),
    'v3': random.uniform(size = 100),
}).sort_values(by=['v2'])

dict_of_dfs = {
    "self": pd.concat([df, pd.Series(random.random(size = 100) + 0.5, name = "weight")], axis = 1),
    "unadjusted": pd.concat([df, pd.Series(np.ones(99).tolist() + [1000], name = "weight")], axis = 1),
    "target": pd.concat([df, pd.Series(np.ones(100), name = "weight")], axis = 1),
}

plotly_plot_dist(dict_of_dfs)

# Make sure the bar plot is plotted with y in the range of 0 to 1.
plotly_plot_dist(dict_of_dfs, ylim = (0,1))

# See the qqplots version
plotly_plot_dist(dict_of_dfs, dist_type="qq")
balance.stats_and_plots.weighted_comparisons_plots.plotly_plot_qq(dict_of_dfs: Dict[str, DataFrame], variables: List[str], plot_it: bool = True, return_dict_of_figures: bool = False, **kwargs) Dict[str, Figure] | None[source]

Plots interactive QQ plot of the given variables.

Creates a plotly qq plot of the given variables from multiple DataFrames. This ASSUMES there is a df with key ‘target’.

Parameters:
  • dict_of_dfs (Dict[str, pd.DataFrame]) – The key is the name of the DataFrame (E.g.: self, unadjusted, target), and the value is the DataFrame that contains the variables that we want to plot.

  • variables (List[str]) – a list of variables to use for plotting.

  • plot_it (bool, optional) – If to plot the plots interactively instead of returning a dictionary. Defaults to True.

  • return_dict_of_figures (bool, optional) – If to return the dictionary containing the plots rather than just returning None. Defaults to False.

  • **kwargs – Additional keyword arguments to pass to the update_layout method of the plotly figure object. (e.g.: width and height are 700 and 450, and could be set using the kwargs).

Returns:

Dictionary containing plots if return_dict_of_figures is True. None otherwise.

Return type:

Optional[Dict[str, go.Figure]]

Examples

import numpy as np
import pandas as pd
from numpy import random
from balance.stats_and_plots.weighted_comparisons_plots import plotly_plot_qq

random.seed(96483)

df = pd.DataFrame({
    'v1': random.random_integers(11111, 11114, size=100).astype(str),
    'v2': random.normal(size = 100),
    'v3': random.uniform(size = 100),
}).sort_values(by=['v2'])

dict_of_dfs = {
    "self": pd.concat([df, pd.Series(random.random(size = 100) + 0.5, name = "weight")], axis = 1),
    "unadjusted": pd.concat([df, pd.Series(np.ones(99).tolist() + [1000], name = "weight")], axis = 1),
    "target": pd.concat([df, pd.Series(np.ones(100), name = "weight")], axis = 1),
}

# It won't work with "v1" since it is not numeric.
plotly_plot_qq(dict_of_dfs, variables= ["v2", "v3"])
balance.stats_and_plots.weighted_comparisons_plots.seaborn_plot_dist(dfs: List[Dict[str, DataFrame | Series]], names: List[str] | None = None, variables: List | None = None, numeric_n_values_threshold: int = 15, weighted: bool = True, dist_type: Literal['qq', 'hist', 'kde', 'ecdf'] | None = None, return_axes: bool = False, ylim: Tuple[float, float] | None = None) List[Axes] | ndarray[Any, dtype[_ScalarType_co]] | None[source]

Plots to compare the weighted distributions of an arbitrary number of variables from an arbitrary number of DataFrames.

Uses: plot_qq_categorical(), plot_qq(), plot_hist_kde(), plot_bar().

Parameters:
  • dfs (List[Dict[str, Union[pd.DataFrame, pd.Series]]]) –

    a list (of length 1 or more) of dictionaries which describe the DataFrames and weights The structure is as follows: [

    {‘df’: pd.DataFrame(…), “weight”: pd.Series(…)}, …

    ] The ‘df’ is a DataFrame which includes the column name that was supplied through ‘column’. The “weight” is a pd.Series of weights that are used when aggregating by the column variable.

  • names (List[str]) – a list of the names of the DataFrames that are plotted. E.g.: [‘adjusted’, ‘unadjusted’, ‘target’]

  • variables (Optional[List], optional) – The list of variables to use, by default (None) will plot all of them.

  • numeric_n_values_threshold (int, optional) – How many unique values (or less) should be in a column so that it is considered to be a “category”? Defaults to 15. This is compared against the maximum number of distinct values (for each of the variables) across all DataFrames. Setting this value to 0 will disable this check.

  • weighted (bool, optional) – If to pass the weights from the dicts inside dfs. Defaults to True.

  • dist_type (Optional[str], optional) – can be “hist”, “kde”, or “qq”. Defaults to None.

  • return_axes (bool, optional) – if to returns axes or None. Defaults to False,

  • ylim (Optional[Tuple[float, float]], optional) – A tuple with two float values representing the lower and upper limits of the y-axis. If not provided, the y-axis range is determined automatically. Defaults to None. Passed only for categorical variables and when dist_type is not ‘qq’ (i.e.: for bar plots).

Returns:

Returns None. However, if return_axes is True then either it returns a list or an np.array of matplotlib AxesSubplot (plt.Subplot). NOTE: There is no AxesSubplot class until one is invoked and created on the fly.

Return type:

Union[List[plt.Axes], np.ndarray, None]

Examples

import numpy as np
import pandas as pd
from balance.stats_and_plots.weighted_comparisons_plots import seaborn_plot_dist
from numpy import random

df = pd.DataFrame({
    'v1': random.random_integers(11111, 11114, size=100).astype(str),
    'v2': random.normal(size = 100),
    'v3': random.uniform(size = 100),
}).sort_values(by=['v2'])

dfs1 = [
    {"df": df, "weight": pd.Series(np.ones(100))},
    {"df": df, "weight": pd.Series(np.ones(99).tolist() + [1000])},
    {"df": df, "weight": pd.Series(np.random.uniform(size=100))},
]

seaborn_plot_dist(dfs1, names=["self", "unadjusted", "target"], dist_type = "qq")  # default
seaborn_plot_dist(dfs1, names=["self", "unadjusted", "target"], dist_type = "hist")
seaborn_plot_dist(dfs1, names=["self", "unadjusted", "target"], dist_type = "kde")
seaborn_plot_dist(dfs1, names=["self", "unadjusted", "target"], dist_type = "ecdf")

# With limiting the y axis range to (0,1)
seaborn_plot_dist(dfs1, names=["self", "unadjusted", "target"], dist_type = "kde", ylim = (0,1))
balance.stats_and_plots.weighted_comparisons_plots.set_xy_axes_to_use_the_same_lim(ax: Axes) None[source]

Set the x and y axes limits to be the same.

Done by taking the min and max from xlim and ylim and using these global min/max on both x and y axes.

Parameters:

ax (plt.Axes) – matplotlib Axes object to draw the plot onto.

Examples

import matplotlib.pyplot as plt
plt.figure(1)
plt.scatter(x= [1,2,3], y = [3,4,5])

plt.figure(2)
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
plt.scatter(x= [1,2,3], y = [3,4,5])
set_xy_axes_to_use_the_same_lim(ax)