balance.sample_frame¶
SampleFrame: an explicit-role DataFrame container for the Balance library.
Stores covariates, weights, outcomes, predicted, and ignored columns with explicit role metadata, replacing the inference-by-exclusion pattern used in the legacy Sample class.
- class balance.sample_frame.SampleFrame[source]¶
A DataFrame container with explicit column-role metadata.
SampleFrame stores data as a single internal pd.DataFrame but with explicit metadata tracking which columns belong to which role: covars (X), weights (W), outcomes (Y), predicted_outcomes (Y_hat), ignored.
Must be constructed via SampleFrame.from_frame() or SampleFrame.from_sample().
- Mutability:
SampleFrame is mostly-immutable at the data level. The underlying DataFrame and column-role assignments are set at construction time and are not replaced afterwards. All data-access properties (e.g.
df_covars,df_weights) return copies, so callers cannot mutate internal state through the returned objects.Controlled mutation points (methods that intentionally modify the instance in-place):
set_active_weight()— changes which weight column is active.add_weight_column()— appends a new weight column to the frame.set_weight_metadata()— updates weight provenance metadata.
These mutations are intentional and expected as part of normal usage (e.g. after calling
BalanceFrame.adjust()). Outside of these methods the object behaves as immutable.
- add_weight_column(name: str, values: Series, metadata: dict[str, Any] | None = None) None[source]¶
Add a new weight column to the SampleFrame.
The column is appended to the internal DataFrame and registered as a weight column. Optionally associates provenance metadata.
- Parameters:
name (str) – Name for the new weight column.
values (pd.Series) – Weight values. Must match the DataFrame length, unless it is a shorter
pd.Series— in which case values are aligned by index and missing rows are filled with NaN (this supports adjustment functions that drop rows internally, e.g.,na_action="drop"). Note: this column is a history column, not the active weight — the active weight is set separately viaset_weights().metadata (dict, optional) – Provenance metadata for the new column.
- Raises:
ValueError – If name is already a registered weight column, if name already exists in the DataFrame as a non-weight column, or if values is longer than the DataFrame or is a non-Series with a different length.
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 2.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.add_weight_column("w_adj", pd.Series([1.5, 1.5]), ... metadata={"method": "rake"}) >>> sf._column_roles["weights"] ['weight', 'w_adj']
- property covar_columns: list[str]¶
Names of the covariate columns.
Returns a copy so that callers cannot accidentally mutate the internal column-role registry.
- Returns:
Covariate column names.
- Return type:
list[str]
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "age": [25, 30], ... "income": [50000, 60000], "weight": [1.0, 1.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.covar_columns ['age', 'income']
- covars(formula: str | list[str] | None = None) Any[source]¶
Return a
BalanceDFCovarsfor this SampleFrame.Creates a covariate analysis view backed by this SampleFrame, inheriting any linked sources set via
_links.- Parameters:
formula – Optional formula string (or list) for model matrix construction. Passed through to BalanceDFCovars.
- Returns:
Covariate view backed by this SampleFrame.
- Return type:
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> sf = SampleFrame.from_frame( ... pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0], ... "weight": [1.0, 1.0]})) >>> sf.covars().df.columns.tolist() ['x']
- property df: DataFrame¶
Full DataFrame reconstruction.
- property df_covars: DataFrame¶
Covariate columns as a DataFrame.
Returns a copy so that callers cannot accidentally mutate the internal data.
- Returns:
A copy of the covariate columns.
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "age": [25, 30], ... "income": [50000, 60000], "weight": [1.0, 1.0]}) >>> sf = SampleFrame.from_frame(df) >>> covars = sf.df_covars >>> covars["age"] = [999, 999] >>> list(sf.df_covars["age"]) # internal data unchanged [25.0, 30.0]
- property df_ignored: DataFrame | None¶
Ignored columns, or None.
Returns a copy so that callers cannot accidentally mutate the internal data.
- Returns:
- A copy of ignored columns, or
None if no ignored columns are registered.
- Return type:
pd.DataFrame | None
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 1.0], "region": ["US", "UK"]}) >>> sf = SampleFrame.from_frame(df, ignored_columns=["region"]) >>> m = sf.df_ignored >>> m["region"] = ["XX", "XX"] >>> list(sf.df_ignored["region"]) # internal data unchanged ['US', 'UK']
- property df_outcomes: DataFrame | None¶
Outcome columns, or None if no outcomes.
Returns a copy so that callers cannot accidentally mutate the internal data.
- Returns:
- A copy of outcome columns, or None if
no outcome columns are registered.
- Return type:
pd.DataFrame | None
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 1.0], "y": [5, 6]}) >>> sf = SampleFrame.from_frame(df, outcome_columns=["y"]) >>> out = sf.df_outcomes >>> out["y"] = [999, 999] >>> list(sf.df_outcomes["y"]) # internal data unchanged [5.0, 6.0]
- property df_weights: DataFrame¶
Active weight column as a single-column DataFrame.
Returns a copy so that callers cannot accidentally mutate the internal data.
- Returns:
- A copy of the active weight column, or an empty
DataFrame if no active weight is set.
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 2.0]}) >>> sf = SampleFrame.from_frame(df) >>> w = sf.df_weights >>> w["weight"] = [999.0, 999.0] >>> list(sf.df_weights["weight"]) # internal data unchanged [1.0, 2.0]
- classmethod from_frame(df: DataFrame, id_column: str | None = None, covar_columns: list[str] | None = None, weight_column: str | None = None, outcome_columns: list[str] | tuple[str, ...] | str | None = None, predicted_outcome_columns: list[str] | tuple[str, ...] | str | None = None, ignored_columns: list[str] | tuple[str, ...] | str | None = None, check_id_uniqueness: bool = True, standardize_types: bool = True, use_deepcopy: bool = True, id_column_candidates: list[str] | tuple[str, ...] | str | None = None) SampleFrame[source]¶
Create a SampleFrame from a pandas DataFrame with auto-detection.
Infers id, weight, and covariate columns from column names when not explicitly provided. Validates the data (e.g., unique IDs, non-negative weights) and standardizes dtypes (Int64 -> float64, pd.NA -> np.nan).
- Parameters:
df (pd.DataFrame) – The input DataFrame containing survey or observational data.
id_column (str, optional) – Name of the column to use as row identifier. If None, guessed from common names (
"id","ID", etc.).covar_columns (list of str, optional) – Explicit list of covariate column names. If None, inferred by exclusion (all columns minus id, weight, outcome, predicted, and ignored columns).
weight_column (str, optional) – Name of the column containing sampling weights. If None, guesses
"weight"/"weights"or creates one filled with 1.0.outcome_columns (list of str or str, optional) – Column names to treat as outcome variables.
predicted_outcome_columns (list of str or str, optional) – Column names to treat as predicted outcome variables.
ignored_columns (list of str or str, optional) – Column names to ignore (excluded from covariates).
check_id_uniqueness (bool) – Whether to verify id uniqueness. Defaults to True.
standardize_types (bool) – Whether to standardize dtypes. Defaults to True.
use_deepcopy (bool) – Whether to deep-copy the input DataFrame. Defaults to True.
id_column_candidates (list of str, optional) – Candidate id column names to try when
id_columnis None.
- Returns:
A validated SampleFrame with standardized dtypes.
- Return type:
- Raises:
ValueError – If the id column contains nulls or duplicates, if the weight column contains nulls or negative values, or if specified outcome/predicted/ignore columns are missing from the DataFrame.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"id": [1, 2, 3], "weight": [1.0, 2.0, 1.5], ... "age": [25, 30, 35], "income": [50000, 60000, 70000]}) >>> sf = SampleFrame.from_frame(df) >>> list(sf.df_covars.columns) ['age', 'income']
- classmethod from_sample(sample: Any) SampleFrame[source]¶
Convert a
Sampleto a SampleFrame.Preserves the Sample’s tabular data and column role assignments: id column, weight column, outcome columns, and ignored columns. Covariate columns are inferred by exclusion, matching the Sample’s own logic.
The internal DataFrame is deep-copied so that the resulting SampleFrame is fully independent of the original Sample.
Warning
Data not preserved in the conversion
The following Sample attributes are not carried over:
_adjustment_model— the fitted model dictionary stored byadjust()._links— references totarget,unadjusted, and other linked Samples (used byBalanceDFfor comparative display).predicted_outcome_columns— Sample has no native concept of predicted-outcome columns, so the resulting SampleFrame will always have an emptypredictedrole.Column ordering may differ after a round-trip (
Sample → SampleFrame → Sample), since SampleFrame stores columns grouped by role rather than preserving the original DataFrame column order.
- Parameters:
sample – A
Sampleinstance.- Returns:
- A new SampleFrame mirroring the Sample’s data and
column roles.
- Return type:
- Raises:
TypeError – If sample is not a Sample instance.
Examples
>>> import pandas as pd >>> from balance.sample_class import Sample >>> from balance.sample_frame import SampleFrame >>> s = Sample.from_frame( ... pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0], "weight": [1.0, 2.0]})) >>> sf = SampleFrame.from_sample(s) >>> list(sf.df_covars.columns) ['x']
- property id_column: str¶
Name of the ID column.
Note
In balance 0.20.0,
id_columnwas changed from returning ID data (pd.Series) to returning the column name (str), for consistency withweight_column. If you need ID data, useid_series.- Returns:
The ID column name.
- Return type:
str
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 1.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.id_column 'id'
- property id_series: Series¶
The ID column as a Series.
Returns a copy so that callers cannot accidentally mutate the internal data.
- Returns:
A copy of the ID column.
- Return type:
pd.Series
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 1.0]}) >>> sf = SampleFrame.from_frame(df) >>> ids = sf.id_series >>> ids.iloc[0] = "MUTATED" >>> sf.id_series.iloc[0] # internal data unchanged '1'
- property ignored_columns: list[str]¶
Names of the ignored columns.
Returns a copy so that callers cannot accidentally mutate the internal column-role registry.
- Returns:
Ignored column names (empty list if none).
- Return type:
list[str]
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 1.0], "region": ["US", "UK"]}) >>> sf = SampleFrame.from_frame(df, ignored_columns=["region"]) >>> sf.ignored_columns ['region']
- property outcome_columns: list[str]¶
Names of the outcome columns.
Returns a copy so that callers cannot accidentally mutate the internal column-role registry.
- Returns:
Outcome column names (empty list if none).
- Return type:
list[str]
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 1.0], "y": [5, 6]}) >>> sf = SampleFrame.from_frame(df, outcome_columns=["y"]) >>> sf.outcome_columns ['y']
- outcomes() Any | None[source]¶
Return a
BalanceDFOutcomes, or None.Returns
Noneif this SampleFrame has no outcome columns.- Returns:
- Outcome view backed by this SampleFrame,
or
Noneif no outcomes are defined.
- Return type:
BalanceDFOutcomes or None
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> sf = SampleFrame.from_frame( ... pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0], ... "y": [1.0, 0.0], "weight": [1.0, 1.0]}), ... outcome_columns=["y"]) >>> sf.outcomes().df.columns.tolist() ['y']
- property predicted_outcome_columns: list[str]¶
Names of the predicted outcome columns.
Returns a copy so that callers cannot accidentally mutate the internal column-role registry.
- Returns:
Predicted outcome column names (empty list if none).
- Return type:
list[str]
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 1.0], "p_y": [0.3, 0.7]}) >>> sf = SampleFrame.from_frame(df, predicted_outcome_columns=["p_y"]) >>> sf.predicted_outcome_columns ['p_y']
- rename_weight_column(old_name: str, new_name: str) None[source]¶
Rename a weight column in-place.
Renames the column in the DataFrame, updates the column roles list, active weight pointer, and weight metadata.
- Parameters:
old_name – Current name of the weight column.
new_name – New name for the weight column.
- Raises:
ValueError – If old_name is not a registered weight column, or if new_name already exists in the DataFrame.
- set_active_weight(column_name: str) None[source]¶
Set which weight column is the active one.
The active weight column is the one returned by
df_weights.- Parameters:
column_name (str) – Must be a registered weight column.
- Raises:
ValueError – If column_name is not a weight column.
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> sf = SampleFrame._create( ... df=pd.DataFrame({"id": [1], "x": [10], "w1": [1.0], "w2": [2.0]}), ... id_column="id", covar_columns=["x"], ... weight_columns=["w1", "w2"]) >>> sf.set_active_weight("w2") >>> list(sf.df_weights.columns) ['w2']
- set_weight_metadata(column: str, metadata: dict[str, Any]) None[source]¶
Store provenance metadata for a weight column.
Metadata is an arbitrary dict that can track adjustment method, hyperparameters, timestamps, or any other provenance information relevant to how the weight column was computed.
- Parameters:
column (str) – Name of the weight column.
metadata (dict) – Arbitrary metadata dict (e.g. method name, hyperparameters, timestamp).
- Raises:
ValueError – If column is not a registered weight column.
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 2.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.set_weight_metadata("weight", {"method": "ipw"}) >>> sf.weight_metadata() {'method': 'ipw'}
- set_weights(weights: Series | float | None, *, use_index: bool = False) None[source]¶
Replace the active weight column values.
This is the canonical weight-update method for balance objects. Both
SampleFrameandBalanceFrameuse this implementation (BalanceFrame delegates here). It also satisfies theBalanceDFSourceprotocol and is used byBalanceDFWeights.trim()to update weight values after trimming.If weights is a float, all rows are set to that value. If None, all rows are set to 1.0. If a Series, behavior depends on use_index:
use_index=False(default): the Series must have the same length as the DataFrame; values are assigned positionally.use_index=True: values are aligned by index. Rows whose index is missing from weights are set to NaN (pandas index-alignment semantics), and a warning is emitted.
All weight values are cast to float64.
- Parameters:
weights – New weight values — a Series, scalar, or None.
use_index – If True, align a Series by index instead of requiring an exact length match.
- Raises:
ValueError – If no active weight column is set, or if
use_index=Falseand a Series has a different length than the DataFrame.
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": [1, 2], "x": [10, 20], ... "weight": [1.0, 2.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.set_weights(pd.Series([3.0, 4.0])) >>> sf.weight_series.tolist() [3.0, 4.0]
- trim(ratio: float | int | None = None, percentile: float | tuple[float, float] | None = None, keep_sum_of_weights: bool = True, target_sum_weights: float | int | np.floating | None = None, *, inplace: bool = False) Self[source]¶
Trim extreme weights using mean-ratio clipping or percentile winsorization.
Delegates to
trim_weights()for the computation, then writes the result back viaset_weights(). A weight history column (weight_trimmed_N) is added so the pre-trim values are preserved.- Parameters:
ratio – Mean-ratio upper bound. Mutually exclusive with percentile.
percentile – Percentile(s) for winsorization. Mutually exclusive with ratio.
keep_sum_of_weights – Whether to rescale after trimming to preserve the original sum of weights.
target_sum_weights – If provided, rescale trimmed weights so their sum equals this numeric target value. (This is a general-purpose rescaling parameter — not related to the “target population” concept in BalanceFrame.)
inplace – If True, mutate this SampleFrame and return it. If False (default), return a new SampleFrame with trimmed weights and the original left untouched.
- Returns:
The SampleFrame with trimmed weights (self if inplace, else a new copy).
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> sf = SampleFrame.from_frame( ... pd.DataFrame({"id": [1, 2, 3], "weight": [1.0, 2.0, 100.0]})) >>> sf2 = sf.trim(ratio=2) >>> sf2.weight_series.max() < 100.0 True >>> "weight_trimmed_1" in sf2._df.columns True
- property weight_column: str | None¶
Name of the currently active weight column, or None.
Note
In balance 0.19.0,
weight_columnwas changed from returning weight data (pd.Series) to returning the column name (str). If you need weight data, useweight_series.- Returns:
The active weight column name.
- Return type:
str | None
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 2.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.weight_column 'weight'
- property weight_columns_all: list[str]¶
Names of all registered weight columns.
Returns a copy so that callers cannot accidentally mutate the internal column-role registry.
- Returns:
Weight column names.
- Return type:
list[str]
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> sf = SampleFrame._create( ... df=pd.DataFrame({"id": [1], "x": [10], "w1": [1.0], "w2": [2.0]}), ... id_column="id", covar_columns=["x"], ... weight_columns=["w1", "w2"]) >>> sf.weight_columns_all ['w1', 'w2']
- weight_metadata(column: str | None = None) dict[str, Any][source]¶
Retrieve metadata for a weight column.
- Parameters:
column (str, optional) – Weight column name. Defaults to the active weight column.
- Returns:
The metadata dict, or an empty dict if none was set.
- Return type:
dict
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": ["1", "2"], "x": [10, 20], ... "weight": [1.0, 2.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.weight_metadata() {}
- property weight_series: Series¶
Active weight column as a Series (BalanceDFSource protocol).
Returns the active weight column values as a
pd.Series. This is the thin protocol-level accessor used byBalanceDFand its subclasses. Unlikedf_weightswhich returns a single-column DataFrame, this returns a plain Series.- Returns:
The active weight column values.
- Return type:
pd.Series
- Raises:
ValueError – If no active weight column is set.
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> df = pd.DataFrame({"id": [1, 2], "x": [10, 20], ... "weight": [1.0, 2.0]}) >>> sf = SampleFrame.from_frame(df) >>> sf.weight_series.tolist() [1.0, 2.0]
- weights() Any[source]¶
Return a
BalanceDFWeightsfor this SampleFrame.Creates a weight analysis view backed by this SampleFrame, inheriting any linked sources set via
_links.- Returns:
Weight view backed by this SampleFrame.
- Return type:
Examples
>>> import pandas as pd >>> from balance.sample_frame import SampleFrame >>> sf = SampleFrame.from_frame( ... pd.DataFrame({"id": [1, 2], "x": [10.0, 20.0], ... "weight": [1.0, 2.0]})) >>> sf.weights().df.columns.tolist() ['weight']