balance.stats_and_plots.general_stats¶
- balance.stats_and_plots.general_stats.relative_response_rates(df: DataFrame, df_target: DataFrame | None = None, per_column: bool = True) DataFrame [source]¶
Produces a summary table of number of responses and proportion of completed responses.
- Parameters:
df (pd.DataFrame) – A DataFrame to calculate aggregated response rates for.
df_target (Optional[pd.DataFrame], optional) – Defaults to None. Determines what is the denominator from which notnull are a fraction of. If None - it’s the number of rows in df. If some df is provided - then it is assumed that df is a subset of df_target, and the response rate is calculated as the fraction of notnull values in df from (divided by) the number of notnull values in df_target.
per_column (bool, optional) –
Default is True. The per_column argument is relevant only if df_target is other than None (i.e.: trying to compare df to some df_target). If per_column is True (default) - it indicates that the relative response rates of columns in df will be
by comparing each column in df to the same column in target. If this is True, the columns in df and df_target must be identical.
If per_column is False then df is compared to the overall number of nonnull rows in the target df.
- Returns:
- A column per column in the original df, and two rows:
One row with number of non-null observations, and A second row with the proportion of non-null observations.
- Return type:
pd.DataFrame
Examples
import numpy as np import pandas as pd from balance.stats_and_plots.general_stats import relative_response_rates df = pd.DataFrame({"o1": (7, 8, 9, 10), "o2": (7, 8, 9, np.nan), "id": (1, 2, 3, 4)}) relative_response_rates(df).to_dict() # {'o1': {'n': 4.0, '%': 100.0}, # 'o2': {'n': 3.0, '%': 75.0}, # 'id': {'n': 4.0, '%': 100.0}} df_target = pd.concat([df, df]) relative_response_rates(df, df_target).to_dict() # {'o1': {'n': 4.0, '%': 50.0}, # 'o2': {'n': 3.0, '%': 50.0}, # 'id': {'n': 4.0, '%': 50.0}} # Dividing by number of total notnull rows in df_rarget df_target.notnull().all(axis=1).sum() # == 6 relative_response_rates(df, df_target, False).to_dict() # {'o1': {'n': 4.0, '%': 66.66666666666666}, # 'o2': {'n': 3.0, '%': 50.0}, # 'id': {'n': 4.0, '%': 66.66666666666666}}