balance.sample_frame

SampleFrame: an explicit-role DataFrame container for the Balance library.

Stores covariates, weights, outcomes, predicted, and ignored columns with explicit role metadata, replacing the inference-by-exclusion pattern used in the legacy Sample class.

class balance.sample_frame.SampleFrame[source]

A DataFrame container with explicit column-role metadata.

SampleFrame stores data as a single internal pd.DataFrame but with explicit metadata tracking which columns belong to which role: covars (X), weights (W), outcomes (Y), predicted_outcomes (Y_hat), ignored.

Must be constructed via SampleFrame.from_frame() or SampleFrame.from_sample().

Mutability:

SampleFrame is mostly-immutable at the data level. The underlying DataFrame and column-role assignments are set at construction time and are not replaced afterwards. All data-access properties (e.g. df_covars, df_weights) return copies, so callers cannot mutate internal state through the returned objects.

Controlled mutation points (methods that intentionally modify the instance in-place):

  • set_active_weight() — changes which weight column is active.

  • add_weight_column() — appends a new weight column to the frame.

  • set_weight_metadata() — updates weight provenance metadata.

These mutations are intentional and expected as part of normal usage (e.g. after calling BalanceFrame.adjust()). Outside of these methods the object behaves as immutable.

add_weight_column(name: str, values: Series, metadata: dict[str, Any] | None = None) None[source]

Add a new weight column to the SampleFrame.

The column is appended to the internal DataFrame and registered as a weight column. Optionally associates provenance metadata.

Parameters:
  • name (str) – Name for the new weight column.

  • values (pd.Series) – Weight values. Must match the DataFrame length, unless it is a shorter pd.Series — in which case values are aligned by index and missing rows are filled with NaN (this supports adjustment functions that drop rows internally, e.g., na_action="drop"). Note: this column is a history column, not the active weight — the active weight is set separately via set_weights().

  • metadata (dict, optional) – Provenance metadata for the new column.

Raises:

ValueError – If name is already a registered weight column, if name already exists in the DataFrame as a non-weight column, or if values is longer than the DataFrame or is a non-Series with a different length.

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 2.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.add_weight_column("w_adj", pd.Series([1.5, 1.5]),
...                      metadata={"method": "rake"})
>>> sf._column_roles["weights"]
['weight', 'w_adj']
property covar_columns: list[str]

Names of the covariate columns.

Returns a copy so that callers cannot accidentally mutate the internal column-role registry.

Returns:

Covariate column names.

Return type:

list[str]

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "age": [25, 30],
...                    "income": [50000, 60000], "weight": [1.0, 1.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.covar_columns
['age', 'income']
covars(formula: str | list[str] | None = None) Any[source]

Return a BalanceDFCovars for this SampleFrame.

Creates a covariate analysis view backed by this SampleFrame, inheriting any linked sources set via _links.

Parameters:

formula – Optional formula string (or list) for model matrix construction. Passed through to BalanceDFCovars.

Returns:

Covariate view backed by this SampleFrame.

Return type:

BalanceDFCovars

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> sf = SampleFrame.from_frame(
...     pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0],
...                   "weight": [1.0, 1.0]}))
>>> sf.covars().df.columns.tolist()
['x']
property df: DataFrame

Full DataFrame reconstruction.

property df_covars: DataFrame

Covariate columns as a DataFrame.

Returns a copy so that callers cannot accidentally mutate the internal data.

Returns:

A copy of the covariate columns.

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "age": [25, 30],
...                    "income": [50000, 60000], "weight": [1.0, 1.0]})
>>> sf = SampleFrame.from_frame(df)
>>> covars = sf.df_covars
>>> covars["age"] = [999, 999]
>>> list(sf.df_covars["age"])  # internal data unchanged
[25.0, 30.0]
property df_ignored: DataFrame | None

Ignored columns, or None.

Returns a copy so that callers cannot accidentally mutate the internal data.

Returns:

A copy of ignored columns, or

None if no ignored columns are registered.

Return type:

pd.DataFrame | None

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 1.0], "region": ["US", "UK"]})
>>> sf = SampleFrame.from_frame(df, ignored_columns=["region"])
>>> m = sf.df_ignored
>>> m["region"] = ["XX", "XX"]
>>> list(sf.df_ignored["region"])  # internal data unchanged
['US', 'UK']
property df_outcomes: DataFrame | None

Outcome columns, or None if no outcomes.

Returns a copy so that callers cannot accidentally mutate the internal data.

Returns:

A copy of outcome columns, or None if

no outcome columns are registered.

Return type:

pd.DataFrame | None

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 1.0], "y": [5, 6]})
>>> sf = SampleFrame.from_frame(df, outcome_columns=["y"])
>>> out = sf.df_outcomes
>>> out["y"] = [999, 999]
>>> list(sf.df_outcomes["y"])  # internal data unchanged
[5.0, 6.0]
property df_weights: DataFrame

Active weight column as a single-column DataFrame.

Returns a copy so that callers cannot accidentally mutate the internal data.

Returns:

A copy of the active weight column, or an empty

DataFrame if no active weight is set.

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 2.0]})
>>> sf = SampleFrame.from_frame(df)
>>> w = sf.df_weights
>>> w["weight"] = [999.0, 999.0]
>>> list(sf.df_weights["weight"])  # internal data unchanged
[1.0, 2.0]
classmethod from_frame(df: DataFrame, id_column: str | None = None, covar_columns: list[str] | None = None, weight_column: str | None = None, outcome_columns: list[str] | tuple[str, ...] | str | None = None, predicted_outcome_columns: list[str] | tuple[str, ...] | str | None = None, ignored_columns: list[str] | tuple[str, ...] | str | None = None, check_id_uniqueness: bool = True, standardize_types: bool = True, use_deepcopy: bool = True, id_column_candidates: list[str] | tuple[str, ...] | str | None = None) SampleFrame[source]

Create a SampleFrame from a pandas DataFrame with auto-detection.

Infers id, weight, and covariate columns from column names when not explicitly provided. Validates the data (e.g., unique IDs, non-negative weights) and standardizes dtypes (Int64 -> float64, pd.NA -> np.nan).

Parameters:
  • df (pd.DataFrame) – The input DataFrame containing survey or observational data.

  • id_column (str, optional) – Name of the column to use as row identifier. If None, guessed from common names ("id", "ID", etc.).

  • covar_columns (list of str, optional) – Explicit list of covariate column names. If None, inferred by exclusion (all columns minus id, weight, outcome, predicted, and ignored columns).

  • weight_column (str, optional) – Name of the column containing sampling weights. If None, guesses "weight"/"weights" or creates one filled with 1.0.

  • outcome_columns (list of str or str, optional) – Column names to treat as outcome variables.

  • predicted_outcome_columns (list of str or str, optional) – Column names to treat as predicted outcome variables.

  • ignored_columns (list of str or str, optional) – Column names to ignore (excluded from covariates).

  • check_id_uniqueness (bool) – Whether to verify id uniqueness. Defaults to True.

  • standardize_types (bool) – Whether to standardize dtypes. Defaults to True.

  • use_deepcopy (bool) – Whether to deep-copy the input DataFrame. Defaults to True.

  • id_column_candidates (list of str, optional) – Candidate id column names to try when id_column is None.

Returns:

A validated SampleFrame with standardized dtypes.

Return type:

SampleFrame

Raises:

ValueError – If the id column contains nulls or duplicates, if the weight column contains nulls or negative values, or if specified outcome/predicted/ignore columns are missing from the DataFrame.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"id": [1, 2, 3], "weight": [1.0, 2.0, 1.5],
...                    "age": [25, 30, 35], "income": [50000, 60000, 70000]})
>>> sf = SampleFrame.from_frame(df)
>>> list(sf.df_covars.columns)
['age', 'income']
classmethod from_sample(sample: Any) SampleFrame[source]

Convert a Sample to a SampleFrame.

Preserves the Sample’s tabular data and column role assignments: id column, weight column, outcome columns, and ignored columns. Covariate columns are inferred by exclusion, matching the Sample’s own logic.

The internal DataFrame is deep-copied so that the resulting SampleFrame is fully independent of the original Sample.

Warning

Data not preserved in the conversion

The following Sample attributes are not carried over:

  • _adjustment_model — the fitted model dictionary stored by adjust().

  • _links — references to target, unadjusted, and other linked Samples (used by BalanceDF for comparative display).

  • predicted_outcome_columns — Sample has no native concept of predicted-outcome columns, so the resulting SampleFrame will always have an empty predicted role.

  • Column ordering may differ after a round-trip (Sample SampleFrame Sample), since SampleFrame stores columns grouped by role rather than preserving the original DataFrame column order.

Parameters:

sample – A Sample instance.

Returns:

A new SampleFrame mirroring the Sample’s data and

column roles.

Return type:

SampleFrame

Raises:

TypeError – If sample is not a Sample instance.

Examples

>>> import pandas as pd
>>> from balance.sample_class import Sample
>>> from balance.sample_frame import SampleFrame
>>> s = Sample.from_frame(
...     pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0], "weight": [1.0, 2.0]}))
>>> sf = SampleFrame.from_sample(s)
>>> list(sf.df_covars.columns)
['x']
property id_column: str

Name of the ID column.

Note

In balance 0.20.0, id_column was changed from returning ID data (pd.Series) to returning the column name (str), for consistency with weight_column. If you need ID data, use id_series.

Returns:

The ID column name.

Return type:

str

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 1.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.id_column
'id'
property id_series: Series

The ID column as a Series.

Returns a copy so that callers cannot accidentally mutate the internal data.

Returns:

A copy of the ID column.

Return type:

pd.Series

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 1.0]})
>>> sf = SampleFrame.from_frame(df)
>>> ids = sf.id_series
>>> ids.iloc[0] = "MUTATED"
>>> sf.id_series.iloc[0]  # internal data unchanged
'1'
property ignored_columns: list[str]

Names of the ignored columns.

Returns a copy so that callers cannot accidentally mutate the internal column-role registry.

Returns:

Ignored column names (empty list if none).

Return type:

list[str]

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 1.0], "region": ["US", "UK"]})
>>> sf = SampleFrame.from_frame(df, ignored_columns=["region"])
>>> sf.ignored_columns
['region']
property outcome_columns: list[str]

Names of the outcome columns.

Returns a copy so that callers cannot accidentally mutate the internal column-role registry.

Returns:

Outcome column names (empty list if none).

Return type:

list[str]

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 1.0], "y": [5, 6]})
>>> sf = SampleFrame.from_frame(df, outcome_columns=["y"])
>>> sf.outcome_columns
['y']
outcomes() Any | None[source]

Return a BalanceDFOutcomes, or None.

Returns None if this SampleFrame has no outcome columns.

Returns:

Outcome view backed by this SampleFrame,

or None if no outcomes are defined.

Return type:

BalanceDFOutcomes or None

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> sf = SampleFrame.from_frame(
...     pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0],
...                   "y": [1.0, 0.0], "weight": [1.0, 1.0]}),
...     outcome_columns=["y"])
>>> sf.outcomes().df.columns.tolist()
['y']
property predicted_outcome_columns: list[str]

Names of the predicted outcome columns.

Returns a copy so that callers cannot accidentally mutate the internal column-role registry.

Returns:

Predicted outcome column names (empty list if none).

Return type:

list[str]

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 1.0], "p_y": [0.3, 0.7]})
>>> sf = SampleFrame.from_frame(df, predicted_outcome_columns=["p_y"])
>>> sf.predicted_outcome_columns
['p_y']
rename_weight_column(old_name: str, new_name: str) None[source]

Rename a weight column in-place.

Renames the column in the DataFrame, updates the column roles list, active weight pointer, and weight metadata.

Parameters:
  • old_name – Current name of the weight column.

  • new_name – New name for the weight column.

Raises:

ValueError – If old_name is not a registered weight column, or if new_name already exists in the DataFrame.

set_active_weight(column_name: str) None[source]

Set which weight column is the active one.

The active weight column is the one returned by df_weights.

Parameters:

column_name (str) – Must be a registered weight column.

Raises:

ValueError – If column_name is not a weight column.

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> sf = SampleFrame._create(
...     df=pd.DataFrame({"id": [1], "x": [10], "w1": [1.0], "w2": [2.0]}),
...     id_column="id", covar_columns=["x"],
...     weight_columns=["w1", "w2"])
>>> sf.set_active_weight("w2")
>>> list(sf.df_weights.columns)
['w2']
set_weight_metadata(column: str, metadata: dict[str, Any]) None[source]

Store provenance metadata for a weight column.

Metadata is an arbitrary dict that can track adjustment method, hyperparameters, timestamps, or any other provenance information relevant to how the weight column was computed.

Parameters:
  • column (str) – Name of the weight column.

  • metadata (dict) – Arbitrary metadata dict (e.g. method name, hyperparameters, timestamp).

Raises:

ValueError – If column is not a registered weight column.

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 2.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.set_weight_metadata("weight", {"method": "ipw"})
>>> sf.weight_metadata()
{'method': 'ipw'}
set_weights(weights: Series | float | None, *, use_index: bool = False) None[source]

Replace the active weight column values.

This is the canonical weight-update method for balance objects. Both SampleFrame and BalanceFrame use this implementation (BalanceFrame delegates here). It also satisfies the BalanceDFSource protocol and is used by BalanceDFWeights.trim() to update weight values after trimming.

If weights is a float, all rows are set to that value. If None, all rows are set to 1.0. If a Series, behavior depends on use_index:

  • use_index=False (default): the Series must have the same length as the DataFrame; values are assigned positionally.

  • use_index=True: values are aligned by index. Rows whose index is missing from weights are set to NaN (pandas index-alignment semantics), and a warning is emitted.

All weight values are cast to float64.

Parameters:
  • weights – New weight values — a Series, scalar, or None.

  • use_index – If True, align a Series by index instead of requiring an exact length match.

Raises:

ValueError – If no active weight column is set, or if use_index=False and a Series has a different length than the DataFrame.

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": [1, 2], "x": [10, 20],
...                    "weight": [1.0, 2.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.set_weights(pd.Series([3.0, 4.0]))
>>> sf.weight_series.tolist()
[3.0, 4.0]
trim(ratio: float | int | None = None, percentile: float | tuple[float, float] | None = None, keep_sum_of_weights: bool = True, target_sum_weights: float | int | np.floating | None = None, *, inplace: bool = False) Self[source]

Trim extreme weights using mean-ratio clipping or percentile winsorization.

Delegates to trim_weights() for the computation, then writes the result back via set_weights(). A weight history column (weight_trimmed_N) is added so the pre-trim values are preserved.

Parameters:
  • ratio – Mean-ratio upper bound. Mutually exclusive with percentile.

  • percentile – Percentile(s) for winsorization. Mutually exclusive with ratio.

  • keep_sum_of_weights – Whether to rescale after trimming to preserve the original sum of weights.

  • target_sum_weights – If provided, rescale trimmed weights so their sum equals this numeric target value. (This is a general-purpose rescaling parameter — not related to the “target population” concept in BalanceFrame.)

  • inplace – If True, mutate this SampleFrame and return it. If False (default), return a new SampleFrame with trimmed weights and the original left untouched.

Returns:

The SampleFrame with trimmed weights (self if inplace, else a new copy).

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> sf = SampleFrame.from_frame(
...     pd.DataFrame({"id": [1, 2, 3], "weight": [1.0, 2.0, 100.0]}))
>>> sf2 = sf.trim(ratio=2)
>>> sf2.weight_series.max() < 100.0
True
>>> "weight_trimmed_1" in sf2._df.columns
True
property weight_column: str | None

Name of the currently active weight column, or None.

Note

In balance 0.19.0, weight_column was changed from returning weight data (pd.Series) to returning the column name (str). If you need weight data, use weight_series.

Returns:

The active weight column name.

Return type:

str | None

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 2.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.weight_column
'weight'
property weight_columns_all: list[str]

Names of all registered weight columns.

Returns a copy so that callers cannot accidentally mutate the internal column-role registry.

Returns:

Weight column names.

Return type:

list[str]

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> sf = SampleFrame._create(
...     df=pd.DataFrame({"id": [1], "x": [10], "w1": [1.0], "w2": [2.0]}),
...     id_column="id", covar_columns=["x"],
...     weight_columns=["w1", "w2"])
>>> sf.weight_columns_all
['w1', 'w2']
weight_metadata(column: str | None = None) dict[str, Any][source]

Retrieve metadata for a weight column.

Parameters:

column (str, optional) – Weight column name. Defaults to the active weight column.

Returns:

The metadata dict, or an empty dict if none was set.

Return type:

dict

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20],
...                    "weight": [1.0, 2.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.weight_metadata()
{}
property weight_series: Series

Active weight column as a Series (BalanceDFSource protocol).

Returns the active weight column values as a pd.Series. This is the thin protocol-level accessor used by BalanceDF and its subclasses. Unlike df_weights which returns a single-column DataFrame, this returns a plain Series.

Returns:

The active weight column values.

Return type:

pd.Series

Raises:

ValueError – If no active weight column is set.

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> df = pd.DataFrame({"id": [1, 2], "x": [10, 20],
...                    "weight": [1.0, 2.0]})
>>> sf = SampleFrame.from_frame(df)
>>> sf.weight_series.tolist()
[1.0, 2.0]
weights() Any[source]

Return a BalanceDFWeights for this SampleFrame.

Creates a weight analysis view backed by this SampleFrame, inheriting any linked sources set via _links.

Returns:

Weight view backed by this SampleFrame.

Return type:

BalanceDFWeights

Examples

>>> import pandas as pd
>>> from balance.sample_frame import SampleFrame
>>> sf = SampleFrame.from_frame(
...     pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0],
...                   "weight": [1.0, 2.0]}))
>>> sf.weights().df.columns.tolist()
['weight']